This article will get you started with Apache Kafka by talking about its characteristics, components and use cases.
Messaging systems play an important role in any enterprise architecture as it enables reliable integration without tighly coupling the applications. In case of stream processing, the messaging systems are even more crucial as it also acts as message store for stream processing engines. This functionality is desirable and/or required in stream processing as processing engine may not be fast enough to keep up with messages published by consumers.
While traditional messaging systems such as JMS, ActiveMQ, RabbitMQ have been quite useful in traditional scenarios, these are not efficient and useful in handling Big Data scenarios. Typically Big Data scenario requires handling hundreds of thousands messages per second. This is where Apache Kafka comes to rescue by providing us the capability of handling huge number of messages per second.
Although Apache Kafka is able to provide the performance needed to handle Big Data scenarios, it does not have as rich feaure-set as that of traditional messaging systems. Some of the functionalities that Apacke Kafkas lacks are message selectors, bridging and routing capabilities.
Apahce Kafka was developed by LinkedIn and is basically a commit log service providing messaging system capabilities. Here are some of the basic characteristics of Apache Kafka -
- Distributed - Apache Kafka is a distributed framework and can utilize multiple commodity machines to store/manage the messages. This enables us to handle huge volume of messages in faster manner as number of I/O operations are limited on one machine. This scheme also results into a cost-effective approach as procuring multiple commodity machines are much cheaper than a very high configuration machine.
- Scalability - Apache Kafka has the capability to scale horizontally by additing multiple nodes to its cluster with no downtime. This means that we can increase and decrease the size of cluster dynamically without impacting the application.
- Durability - Apache Kafka provides the messages durability by persisting the messages on disk rather than keeping it in memory. This ensures that messages stored in Kafka broker are not lost if broker goes down. These messages can also be used for batch consumption.
- Message Replication - Apache Kafka provides the capability to configure the replication factor for the messages published on a Topic. This lets us to create multiple copies of messages on different machines to avoid message loss if any of broker is down.
- Fault Tolerance - Apache Kafka can survive the broker machines failure in a cluster. In case any broker is down, Kafka start serving the messages from other available machines if message replication is properly configured.
- High Throughput - Apache Kafka provides high throughput to both producers and consumers. Apache Kafka can easily handle hundreds of thousand messages per second and can further scaled to achieve higher throughput by adding more machines to cluster.
Apache Kafka is consisted of various components as shown in following diagram -
- Producers - Producers are any applications/programs that publish messages to Kafka brokers. These applications can be front-end applications, batch jobs, Apache Flume agents, stream based applications and background processes.
- Consumers - Consumers are the applications that consume messages from Kafka brokers. These consumers can be a simple application, a real-time stream processing engine or hadoop pipeline. Apache Kafka provides two types of APIs- High Level Consumer API and Simple Consumer API. High Level API is very easy to use and can be used to read the messages from partitions of a Topic in the sequence.
On the other hand, Simple Consumer API is more of low level API and can be used for more flexibility in terms of consuming messages randomly, rollbacking the messages and re-processing the same message etc.
- Topics and Partitions - Apache Kafka supports the concepts of message Topics that allows you to categorize the messages. It enables us to create different Topics for different types of messages and have different consumers consuming the messages.
Apache Kafka further allows to create multiple partitions in a Topic to allow the parallel consumption of messages as we can have separate consumers consuming from different partitions at same time. Partitions however should only be created if message ordering is not important for the messages of a Topic. Each partition has a leader node that is responsible for accepting the read/write requests from consumers/producers for that partition.
By default, all the Topics have one partition but this behvaiour is configurable while setting up message brokers using the property called - num.partitions.
- Kafka Broker - Kafka broker typically refers to machine with Kafka installed on it. However it is possible to setup more than one brokers on a single machine in non-production setting. Kafka broker is responsible for managing the message logs and accepting the requests from producers/consumers.
- Kafka Cluster - Kafka cluster is collection of Kafka brokers. All the Kakfa brokers in a cluster work collectively to manage the messages and their copies as configured.
- Apache ZooKeeper - Apache Kafka brokers utilizes ZooKeeper for co-ordinating among each other to create a cluster. All the metadata information such as Topics information, partitions information and partition leader node are stored in Apache Kafka. Kafka cluster also relies on ZooKeeper to watch for broker failures and choosing the new leaders for partitions.
Apache Kafka is a reliable and mature project that is being utilized by industry leaders such as LinkedIn, Twitter, Yahoo, Netflix etc. Here are some of use cases of Apache Kafka -
- Messaging
- Stream Processing
- Website Activity Tracking
- Log Aggregation
- Time based Message Storage
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.