Guide on Kafka – An open-source distributed streaming system
Data Engineering

Guide on Kafka – An open-source distributed streaming system

When we talk of Big Data, we are looking at enormous amounts of data. As far as data is concerned, there are two main challenges. First challenge is storing this huge amount of unstructured data and second is analysing them in real-time for better business decisions. A system that can queue, store, transmit, and produce real-time results is necessary to overcome these challenges.

In distributed high throughput systems, Kafka is designed to handle large amounts of data. Compared to other traditional message brokers, Kafka tends to perform well. For large-scale message processing applications, Kafka has better throughput, built-in partitioning, replication, and inherent fault-tolerance compared to other messaging systems.

Let’s walk through together to know about Kafka!

What is Kafka?

Kafka is an open-source distributed streaming platform that is used to build real-time, event-driven applications. Specifically, it allows developers to make applications that continuously produce and consume streams of data records.

It runs as a cluster that spans multiple servers or data centers. Record production is replicated and partitioned so that many users can use the application simultaneously without perceptible performance lags.

Kafka is super-fast! It also maintains a very high level of accuracy with the data records, – and  it maintains the order of their occurrence. Kafka is also highly fault-tolerant, with support for data replication and partitioning to ensure that data is always available and can be processed in parallel.

So, these characteristics all together add up to a compelling platform.

Some of the key components and related terminology of Kafka are:

  1. Topics:A topic is a category or feed name to which messages are published by producers. It is a logical abstraction of the data streams that are being processed and can be thought of as a stream of records.
  2. Producers:Producers are applications that publish data to Kafka topics. They can be thought of as a source of data that sends messages to a specific topic.
  3. Consumers:Consumers are applications that subscribe to Kafka topics and consume data from them. They can be thought of as a sink of data that receives messages from a specific topic.
  4. Brokers:Brokers are servers that manage and store the Kafka topics. They receive messages from producers and send messages to consumers. They can be thought of as a mediator between producers and consumers.
  5. Partitions:A partition is a unit of parallelism in Kafka. Each topic is divided into one or more partitions, which are distributed across the Kafka brokers. Messages are stored in partitions, and consumers can read messages from one or more partitions.
  6. Offsets:An offset is a unique identifier for each message in a partition. Consumers can track their progress in reading messages from a partition by storing the offsets of the last message they have consumed.
  7. Consumer Groups:A consumer group is a group of consumers that jointly consume messages from a topic. Each message in a partition can be consumed by only one consumer in a consumer group. This enables scaling the consumption of messages across multiple consumers.
  8. Replication:Replication is a mechanism by which Kafka ensures high availability and durability of messages. Each partition is replicated across multiple brokers, and one broker is designated as the leader for each partition. The leader is responsible for handling all read and write requests for a partition, and the replicas synchronize with the leader to ensure that they have the same copy of the data.
  9. ZooKeeper:ZooKeeper is a distributed coordination service that Kafka uses for cluster coordination, management, and configuration. It maintains the metadata about the Kafka brokers, topics, partitions, and consumer groups.
  10. Streams:Kafka Streams is a library for building real-time stream processing applications on top of Kafka. It enables developers to write stream processing applications using a simple, high-level API.

Why would you use Kafka?

Data pipelines are built using Kafka. An application that consumes streams of data is a streaming application, and each data pipeline reliably moves data between systems. You can use Kafka to ingest and store streaming data while serving read requests in parallel from applications that drive the data pipeline. For example,  you are creating a data pipeline that takes in user activity data to track how users navigate your website. As a message broker, Kafka can also be utilized to process and mediate communication among applications.

  • Real-time data processing:Kafka is designed to handle large volumes of data in real-time. It can process millions of events per second and can support applications that require low-latency data processing.
  • Scalability:Kafka is designed to be highly scalable, both in terms of data volume and processing throughput. It can handle large volumes of data and can be easily scaled by adding more brokers to the cluster.
  • Fault tolerance:Kafka is designed to be highly fault-tolerant. It can tolerate the failure of individual brokers and can recover from failures by replicating data across multiple brokers.
  • Data integration:Kafka can integrate data from multiple sources and systems. It can be used to collect data from various sources and store it in a central location for processing and analysis.
  • Streaming analytics:Kafka can be used for real-time streaming analytics. It can process data as it flows through the system and generate real-time insights and alerts.
  • Microservices architecture:Kafka is well-suited for microservices architecture. It can be used to decouple data producers and consumers and provide a scalable and reliable communication layer between microservices.
  • Data management:Kafka can be used for managing large volumes of data in distributed systems. It can be used to store data in a scalable and fault-tolerant manner and provide a unified interface for data access and management.

How does Kafka work?

By combining queuing and publish-subscribe, Kafka provides consumers with the key advantages of both messaging models. With queueing, data processing can be distributed across multiple consumer instances, enabling high scalability. The traditional queue, however, is not multi-subscriber. The publish-subscribe approach is a multi-subscriber scenario, but since messages are sent to every subscriber, this approach cannot be used to distribute work among multiple worker processes.

To stitch together these two solutions, Kafka uses a partitioned log model. Each log entry consists of a sequence of records, and these records are then separated by subscribers into segments or partitions. Consequently, multiple subscribers can be assigned a partition to allow for enhanced scalability of the same topic. An additional feature of Kafka’s model is replay ability, which enables different applications to read data streams independently of each another.

Kafka Architecture –

Kafka APIs

Kafka Architecture has four core APIs, Producer API, Consumer API, Streams API, and Connector API. Let’s look at them one by one:

  • Producer API
    An application can publish records to a Kafka topic using it.
  • Consumer API
    Applications using this API can subscribe to and process a stream of records created by one or more topics.
  • Streams API
    The streams API enables applications to act as stream processors, consuming input streams from multiple topics and producing output streams.
  • Connector API
    Connector API establishes connections between Kafka topics, existing applications, and data systems.

Kafka Cluster

  • Kafka Broker
    In general, Kafka clusters consist of multiple brokers for load balancing. Since these clusters are stateless, ZooKeeper is used to maintain cluster state. While Kafka Brokers can handle millions of transactions per second, each instance can manage a maximum of five transactions per second. A broker can handle TBs of messages without negatively impacting performance. Additionally ensures that ZooKeeper performs the election of Kafka broker leaders.
  • Kafka – ZooKeeper
    Kafka brokers uses ZooKeeper, to manage and coordinate data. Zookeeper notifies producers and consumers if there is a new Kafka broker or if the broker fails. Zookeeper will notify producers and consumers when the broker is present or unavailable. The producer and consumer will then coordinate their tasks with another broker.
  • Kafka Producers
    Data is also pushed to brokers by Kafka producers. A message is sent to all the producers when the new broker launches, exactly when that broker begins. It is pertinent to note that the Kafka producer doesn’t wait for the broker’s acknowledgment, it sends messages as fast as possible.
  • Kafka Consumers
    Due to the statelessness of Kafka brokers, the Kafka Consumer maintains how many messages have been consumed by partition offset. Additionally, once the consumer acknowledges a particular message offset, you can be assured that all previous messages have been consumed.In addition, the consumer issues asynchronous pull requests to the broker to obtain a buffer of bytes that can be consumed. As soon as consumers provide an offset value, they can rewind or skip to any point in a partition. A consumer offset value is also notified by ZooKeeper.

Kafka Pros and Cons –

Pros of Kafka

  • High-throughput 
    Without not-so-large hardware, Kafka can handle high-velocity and high-volume data. Also, support the message throughput of thousands of messages per second.
  • Low Latency 
    The platform can handle these messages with very low latency in the range of milliseconds, which is required by most new application scenarios.
  • Fault-Tolerant
    A key advantage of this product is its fault tolerance. Within a cluster, Kafka has the inherent capability of being resistant to node and machine failures.
  • Durability
    Data/messages are durable if they are persistent on disk. Furthermore messages are never as message replication ensures this, which is one of the factors contributing to durability.
  • Scalability
    In order to scale Kafka additional nodes can be added on the fly, without incurring any downtime. Furthermore, within a Kafka cluster, messages are handled transparently and seamlessly.
  • Message Broker Capabilities
    When used in place of a more traditional message broker, Kafka works very well. Message broker refers to an intermediary program, which translates messages from the publisher’s formal messaging protocol to the receiver’s formal messaging protocol.
  •  High Concurrency 
    Because of its low latency and high throughput, Kafka is capable of handling thousands of messages per second. In addition, it permits the reading and writing of messages into it at high concurrency.
  • Consumer Friendly 
    Kafka allows integration with a variety of consumers. Kafka has the advantage of being able to behave or act differently depending on the consumer with which it integrates. This is due to the fact that each customer is able to handle messages differently. Further, Kafka can be easily integrated with a wide range of consumers and written in a variety of languages.
  • Batch Handling Capability (ETL like functionality) 
    Since Kafka is capable of persisting messages, it can also be used for batch-like use cases and perform an ETL’s functions.
  • Variety of Use Cases
    Kafka solutions can manage a wide variety of use cases commonly required for a Data Lake. A few examples of this would be log aggregation, tracking of web activity, and so on.
  •  Real-Time Handling 
    Kafka is capable of handling real-time data pipelines and its ability to handle real-time messages from applications.

Cons of Kafka

  • No Complete Set of Monitoring Tools 
    This system lacks a full set of tools for monitoring and managing it. Therefore, enterprise support staff are indecisive ofusing Kafka and supporting it in the long run due to anxiety or fear.
  • Issues with Message Tweaking 
    Brokers deliver messages to consumers using certain system calls. Kafka however, suffers from significant performance degradation if the message requires some modification. Due to the fact that it uses the capabilities of the system, it is capable of performing quite well if the message remains unchanged.
  • Reduces Performance
    It is generally okay to send a large number of messages. However, as the size of the messages increases, brokers and consumers begin compressing them. Therefore, when the data is decompressed, the node’s memory gets consumed. In addition, compression occurs when data flows through the pipeline, performance as well as throughput are affected.
  • Lack of some Messaging Paradigms 
    Kafka lacks messaging paradigms such as request/reply, point-to-point queues, etc. It is not always a problem, but for certain applications, it may be.

Types of Kafka Tool

The Kafka Tools are categorized into two types:

  • System tools
  • Replication tools

Kafka System Tools

Kafka’s system tools can be run from the command line using run class script.

Types of Kafka system tools:

  • Kafka Migration Tool
    Kafka Migration Tool is used to migrate a Kafka broker between versions.
  • Mirror Maker
    Mirror Maker enables mirroring of one Kafka cluster too another.
  • Consumer Offset Checker
    The Consumer Off-Set Checker Tool displays Kafka Topic, Off-Set, logSize, Consumer Group, Partitions, and Owner for a specified set of Topics and Consumer Groups.

Kafka Replication Tool

This is an advanced design tool. A replication tool has been added here to enhance durability and availability.

Kafka Replication tools, however, are:

  • Create Topic Tool
    The Create Topic Tool creates topics based on replication factors and default partition sizes. As well, the replica assignment is done using Kafka’s default scheme.
  • List Topic Tool
    It provides information about a set of topics. The tool, queries Zookeeper for all the topics. if no topics are available on the command line. This tool is capable of displaying several fields, including a topic name, partition, leader, replicas, and ISRS.
  • Add Partition Tool
    Whenever a topic is created, the number of partitions must be specified. However, as the volume increases, we may need to divide the topic into more partitions.

Kafka Use Cases

Let’s go through some use cases for this, or actually, before we do, let’s talk about how applications used to be made before event streaming was on the scene. If the developer wanted to make a retail application, they might create a checkout, and  with that checkout, they may wish it to trigger a shipment when it happens.

So, when a user checks out, eventually the order gets shipped. Considering the shape of the data, the way data is transported, and the format of data, it is only one integration that needs to be developed and it’s not a huge deal.

But, as the application grows, we may want to add an automated email receipt when a checkout happens, or we may wish to add an update to the inventory when a checkout happens.

As the front and back-end services get added and the application grows, more and more integrations need to be built, and eventually get very messy. Not only that, teams in charge of each of the services are now reliant upon each other, before they can make any changes and development could be faster.

So, one excellent use case for Kafka is decoupling system dependencies. With Kafka, all the challenging integrations go away and instead, the checkout itself will stream events. Every time a checkout happens that will get streamed, and the checkout is not concerned with who’s listening to that stream.

It is broadcasting all those events. Then the other services like- email, shipment, inventory- subscribe to that stream, choose to listen to that one, and then get the information they need and it triggers them to act accordingly. This is how Kafka can decouple your system dependencies and is also a good use case for how Kafka can be used for messaging. So, even if this application was built from the ground as a cloud-native application, it could still be built in this way and use messaging to elevate the checkout experience.

Gist of the Best Kafka Use Cases:

What is Kafka used for?

  • Activity tracking
  • Real-time data processing
  • Messaging
  • Operational Metrics/KPIs
  • Log Aggregation

When ‘Not’ to use Kafka

  • “Little” Data as it is designed for large volume
  • Streaming ETL

Summing up

One of the main reasons for Kafka’s popularity is its power and flexibility.  This system has proven itself to be reliable, scalable, and fault-tolerant. For scenarios requiring real-time data processing, monitoring application activity, as well as real-time data processing, Kafka is an invaluable tool.

Want to know how Kafka can help you in achieving your goals!

Talk our kafka experts

Parijat Sengupta
Senior Content Strategist

Parijat works as a Senior Content Strategist at Enhops. Her expertise lies in converting technical content into easy-to-understand pieces that help decision-makers in selecting the right technologies to enable digital transformation. She also enjoys supporting demand-generation and sales functions by creating and strategizing content for email campaigns, social media, blogs, video scripts, newsletters, and public relations. She has written content on Oracle, Cloud, and Salesforce.