How Apache Kafka Works?

Anshuman Pattnaik
9 min readApr 25, 2023

Kafka is a distributed streaming platform, one of the industry’s most widely used framework. The primary use case of this platform is to build a real-time streaming data pipeline that combines messaging, storage, and stream processing to analyze historical and real-time data. This article will teach the architecture behind Kafka and its use cases.

What is Apache Kafka?

Apache Kafka is an open-source streaming data platform originally developed by LinkedIn. To further enhance the development of this project, LinkedIn donated the project to the Apache software foundation, and now the framework is being used by many companies for mission-critical projects.

Kafka is designed to handle high-throughput distributed systems that run as a cluster and can scale to handle large-scale applications and also serve as a storage system that keeps the data as long as necessary, which means it becomes a unique messaging queue from traditional message queue that removes messages immediately after the consumer reads the message. The amount of time Kafka stores the data is called retention. These retention periods are configurable, and you can set them in the log retention policy. The default value is 168 hours (seven days), after which Kafka will automatically drop the logs even if the segment is not full. Still, as I mentioned, you can change this period and set it to “forever” which will keep the log indefinitely.

Can Apache Kafka consider as a Database?

As we’ve seen, Kafka can also serve as a storage system that can replicate data over the brokers inside a cluster, but can we trade off using Kafka as a database? That depends on your requirements, but in theory, the answer is yes because there are many add-ons provided by Kafka, like ksqlDB or Tiered storage for data processing and event streaming, that you can use in your application. But many different types of databases are available, like relational databases, NoSQL, and many more. Now the question is how to choose which database is suitable for your application, either Kafka or any other traditional database system.

Trade-Offs

  1. What is the structure of the data? (i.e. JSON, CSV, etc.)
  2. How long does the data need to be retained in the database?
  3. Do I have to run any complex aggregation queries to retrieve the data?
  4. Do I require an ACID guarantee?
  5. What is the volume?

These trade-offs require more analysis, and once you have found all the answers to all these questions, you can decide which database is suitable for your application. Always remember every database architecture is different, and it has its characteristics.

What is the architecture and concept behind Apache Kafka?

This section will teach some core concepts of Apache Kafka to understand more in-depth how Kafka works.

Topics

A topic refers to a category in Kafka where producers publish a stream of data from a particular category, and consumers read those data by subscribing to a specific topic. Once data is published within a topic, it’s immutable, and the user can create n number of topics within a Kafka cluster. These topics are quite similar to tables in a database but with no constraints and must be unique so their name can identify them within a Kafka cluster.

Youtube application can be an example of a Kafka Topic, imagine there are tons of creators who create videos from different areas and publish the videos on youtube with a unique channel name. The user first needs to subscribe to these channels to watch the videos and get a notification when a new video is uploaded. So the channel name is nothing but a Kafka topic; each creator pushes content to the specific channels, and the videos get organized in the youtube system.

Partitions

In Kafka, partitions are the smallest storage units, and the messages are stored in the partitions within a particular topic. It’s essential to note that while creating a topic, it’s required to specify the number of partitions. Each message is stored in a partition with an incremental id called offset. This offset value provides guarantees that each message is stored in a partition in a sequenced fashion. There is no limitation to partitions; there can be n number of partitions within a topic.

In the above diagram, the topic contains three partitions. Each partition has different offset values that holds unique individual messages, and the relevant partition gets selected when we write a message to a topic. In this example diagram, Partition 0 starts with offset 0, increments the next offset by +1, and at the 10th offset, the next message should be written and stores the messages in each offset sequentially. Partition 1 has 0 to 5 offsets, and the next message that should be written is the 6th. Similarly, Partition 2 has 0 to 9 offsets; the next message that should be written is the 10th offset. So, all three partition stores the messages in each position independently with their offsets.

Brokers

A Kafka broker is a container that holds different topics and their partitions. Each broker within a cluster also knows as a Bootstrap broker because a broker contains information on all other brokers, partitions, and topics within a cluster. Kafka Brokers store all the data in a server directory, and each topic partition gets its directory with a topic name associated with it. The main architecture behind Kafka Broker is to achieve higher throughput and scalability on topics, and partitions are distributed among brokers evenly within a cluster.

In the above diagram, each broker has two topics (A & B) and each partition is distributed among each broker. It’s essential to understand a single broker does not hold all the partitions of a topic. Similarly, topic B has three partitions, but it is distributed among each broker within a cluster and has no relationship with the topic A partition.

Kafka Broker is highly efficient in distributing partitions among brokers. In high traffic, the Kafka Administrator can move the partition to a different broker and load balance the cluster.

Producers

Kafka Producers is the client that publishes data to topics within different partitions. The main architecture behind producers is to identify automatically which type of data to store on which partition and broker. It’s not required by the user to set any configuration to specify the broker and partition. But in some cases, the user needs to set the preference to specify the partition with a message key to store the message in a specific order.

The message key allows producers to send data to a specific partition, and if the producer doesn’t apply the key, the data will be written to the default round-robin in each partition. This technique is called load balancing, and this process is quite helpful when the traffic is high and allows a producer to write a little bit of data to distributed partition.

In the above diagram, the producer writes data to a Kafka cluster without specifying the key. The data gets distributed among each partition over Topic-A under each broker — (Broker — 01, Broker — 02, and Broker — 03).

In the above diagram, the producer writes data with specifying the key as msg_id, so in partition 0 under Broker 1, the data will be sent as msg_id_1, and in partition 1 under Broker 2, the data will be sent as msg_id_2. The message key concept helps store the data in a specified partition uniquely and retrieve the data by the consumer with the key name.

Consumer and Consumer Groups

To read data from Kafka, it’s quite important to understand consumers and consumer groups, as reading data from Kafka is a bit different than other messaging systems. Kafka uses a KafkaConsumer, which creates a consumer object, subscribes to appropriate Kafka topics, and starts receiving messages from these topics.

The application will create one consumer object within a consumer group and subscribe to the appropriate topic and start receiving messages, this methodology will work when traffic is less. Still, in case of higher traffic, the producers start writing more and more messages on the topic. With a single consumer, the application fails to balance the load with the incoming rate and is unable to read the messages from the topic. To resolve this issue, we definitely need to scale and allow multiple consumers to read data from the same topic. When multiple consumers subscribe to the same topic within the same consumer group, each consumer will receive messages from different partitions of the topic.

In the above diagram, there are 3 consumers within a single consumer group, and it’s the main way to scale the data consumption from a Kafka topic by allowing multiple consumers to a consumer group. So, when there is high traffic, Kafka consumers can balance the load and provide high latency in the application. It’s also essential to understand that there is no point in adding multiple consumers when there is less traffic to the application and fewer partitions; otherwise, some consumers will be just idle.

I’ll publish more in-depth about Kafka Consumers architecture in my future articles.

What is Zookeeper?

Zookeeper is a free and open-source distributed software developed by Apache Software Foundation. Zookeeper act as a centralized service to coordinate and manage Kafka cluster nodes and keeps track of Kafka producers, consumers, topics and partitions.

Zookeeper distributes data over multiple collections of nodes and provides high availability and consistency within a cluster. If a node fails then instantly Zookeeper can performs failover migration and starts a new node in real-time.

The services provided by Zookeeper:

  1. Naming Service — In a cluster there are multiple nodes and to identify a node required naming service similar to DNS.
  2. Configuration Management — To manage the configuration of the system throughout the life cycle it’s important to have a configuration management system.
  3. Cluster Management — To monitor when the node joining / leaving and node status at real time in a cluster, it’s requied to have cluster management.
  4. Leader Election — The server that has been selected by ensemble server is called a Leader. A node is elected as leader for coordination purpose.
  5. Locking and Synchromization service — The mechanism of locking helps to lock the data while modifying it and helps automatic fail recovery while connecting to other distributed applications.
  6. Highly reliable data registry — In case of one or few nodes are down the data always be available.

Zookeeper framework is a highly efficient to manage distributed application, and handled data inconsistency with atomicity and provides various mechanism to overcome with many challenges such as Race condition and deadlock with fail-safe synchronization approach.

I hope you enjoyed reading this article, which gave you an insight into Kafka architecture and how it works. If you think this article helped you in anyways then feel free to share it with your friends.

--

--

Python | Application Security | Web Security | Cybersecurity | Software Development