RC: W12 D2 — What does Kafka even do?

April 30, 2024

My RC batch is finishing at the end of the week. I have focused most of my batch deepening my understanding of databases by re-implementing some from scratch. Over the few days I have left, I wanted to tackle another project that would be somehow related to databases but that would bring up some completely new aspects I have not been confronted with yet. I thus decided to try re-implementing Apache Kafka, as it has some database-like features (like data persistence) even though it is not a database. Before today, I had only a very vague idea of what it was, having never used it myself. So I spent the day reading about what it does exactly and what its core components are so that I could define which parts I would try to redo in the 3 days I have left. Sun shared with me the definitely cutest (and very clear) explanation of what it is for absolute beginners like me: the “Gently down the stream” comic. Also, Tim Berglund’s Kafka Fundamentals videos are extremely clear to get a broad overview of what it consists of.

Here is what I understood. Kafka is an event-streaming platform: its job consists of managing and processing a continuous flow of events (a.k.a. data records). Those events are inputted in the system via producers and later processed by consumers. The diagram below represents the main components in a Kafka system:

+---------------+    +-------------------------------------+     +----------------+
|   Producers   |    |            Kafka Cluster            |     |   Consumers    |
|               +--->|                                     +---->|                |
|  (Send data)  |    |  +-------------------------------+  |     | (Receive data) |
+---------------+    |  |            Topic A            |  |     +----------------+
                     |  +-------------------------------+  |     
                     |  |+-------------+ +-------------+|  |
                     |  || Partition 1 | | Partition 2 ||  |       +------------+
                     |  || (Broker #1) | | (Broker #2) ||  +------>| Consumers  |
                     |  |+-------------+ +-------------+|  |       | (Group #1) |
                     |  || Record [0]  | | Record [0]  ||  |       +------------+
                     |  || Record [1]  | | Record [1]  ||  |
                     |  ||     ...     | |     ...     ||  |       +------------+
                     |  || Record [n]  | | Record [n]  ||  +------>| Consumers  |
                     |  |+-------------+ +-------------+|  |       | (Group #2) |
                     |  +-------------------------------+  |       +------------+    
                     |                                     |    
                     +-------------------------------------+
                     

In between producers (that send data) and consumers (that process it), there is the Kafka cluster. The cluster is divided into multiple “topics”, i.e. logical collections of “related” data: producers are usually configured to send each kind of events onto a particular topic (here, which events are “related” and should be sent to a given topic depends on business requirements and are configured freely by the user or administrator of the Kafka cluster).

Each topic may correspond to one or multiple “partitions”, i.e. logs of ordered events. Partitions durably store the events: events are written to disk and can be retained for any period of time, from seconds to forever, depending on the configuration. This period of time is called the “retention time”. Since events are stored durably on disk, multiple groups of consumers can consume the same events, each in their own way to fit their particular purpose. In addition, because events are physically ordered in partitions, they are guaranteed to be processed in the same order by all groups of consumers.

Physically, the Kafka cluster is made of one or multiple machines (called “brokers”), each of which keeps track of one or multiple partitions.

Now that I have a broad understanding of all the core components of Kafka, I have a clearer view of what I should do to implement an MVP in 3 days. As a first pass, I will choose to skip the serialization/deserialization to store events on disk, as this is going to be similar to what I have already done for my previous database project. Instead, I will focus on making sure I can have components that can communicate events between one another. I will start tomorrow and see how far I can go before my batch is over!