Introduction into Kafka in Production - Part 1

Introduction

Does anyone know what kafkasque is? I didn’t until I started working with Apache Kafka and discovered its nightmarish complexity 🙂 As with any technology, gaining knowledge makes complexity go away or at least complexity becomes less nightmarish.

In this (very) opinionated introduction we are going to dive into a few important aspects of Kafka. In addition, we’ll deploy basic Kafka demo using Docker. Of course, this post doesn’t replace existing great courses, books and official documentation.

Though, it gives you a basic understanding of what Kafka is and how it works.

Table Of Contents

Introduction
Kafka introduction
Kafka basic use case architecture
Kafka small demo
Quickstart deep dive
Kafka basic terms
Summary

Kafka introduction

I’ve come through many definitions of what Kafka is.

For instance:

high-throughput distributed messaging system
distributed commit-log

Regardless of the exact definition, all of them address a common enterprise scenario: moving data between applications and datastores.

Kafka started as an internal project in LinkedIn in 2009 to solve the need for exponentially growing data as the company grew.

It was designed with below features in mind:

high throughput
horizontally scalable
reliable and durable
loosely coupling between producers and consumers
publish-subscribe semantics

Therefore, it’s not a coincidence that Kafka has become perfect fit for:

microservices based architectures
event-driven architectures

Today at LinkedIn, Kafka operates more than 1 trillion messages per day.

Kafka basic use case architecture

Kafka small demo

To demonstrate basic usage scenario, we’ll raise basic Kafka demo using docker and docker-compose.

Prerequisites (use latest versions in case of doubt)

docker
docker-compose
git

Quickstart

Follow the below steps to raise a sample Kafka multi-container application:

clone the repository from my github and cd into it.
run the application stack: docker-compose -p kafka-demo up -d
view stack containers:

$ docker-compose -p kafka-demo ps
     Name                   Command                  State                       Ports                 
-------------------------------------------------------------------------------------------------------
kafka            /opt/bitnami/scripts/kafka ...   Up (healthy)   9092/tcp                              
kafka-consumer   kafkacat -b kafka:9092 -C  ...   Up                                                   
kafka-producer   kafkacat -b kafka:9092 -t  ...   Exit 0                                               
zookeeper        /opt/bitnami/scripts/zooke ...   Up (healthy)   2181/tcp, 2888/tcp, 3888/tcp, 8080/tcp

restart Kafka to let produced messages messages persist to disk: $ docker-compose -p kafka-demo restart kafka
inspect logs of kafka consumer: docker-compose -p kafka-demo logs -f kafka-consumer
you should see consumed messages produced by Kafka producer from my_msgs.txt:

kafka-consumer | Key (-1 bytes): kafka-consumer | Value (92 bytes): {1:{"order_id":1,"order_ts":1534772501276,"total_amount":10.50,"customer_name":"Bob Smith"}} kafka-consumer |  artition: 0 Offset: 0 kafka-consumer | -- kafka-consumer | kafka-consumer | Key (-1 bytes): kafka-consumer | Value (93 bytes): {2:{"order_id":2,"order_ts":1534772605276,"total_amount":3.32,"customer_name":"Sarah Black"}} kafka-consumer |  artition: 0 Offset: 1 kafka-consumer | -- kafka-consumer | kafka-consumer | Key (-1 bytes): kafka-consumer | Value (94 bytes): {3:{"order_id":3,"order_ts":1534772742276,"total_amount":21.00,"customer_name":"Emma Turner"}} kafka-consumer |  artition: 0 Offset: 2

Quickstart deep dive

So what happened here? Let’s go over the most interesting points.

ZooKeeper

In short, ZooKeeper is Kafka cluster coordinator. We need it regardless if Kafka deployed as a single node or cluster of nodes. It’s important to note that, Kafka will remove ZooKeeper from its deployments in the future. Yet, it’s still an experimental feature.

In our basic demo we have a single Kafka node. Run below commands in ZooKeeper container to view its metadata.

docker exec -it zookeeper zkCli.sh

It opens ZooKeeper shell. Run inside of it:


zk: localhost:2181(CONNECTED) 0] ls /

[admin, brokers, cluster, config, consumers, controller, controller_epoch, feature, isr_change_notification, latest_producer_id_block, log_dir_event_notification, zookeeper]

[zk: localhost:2181(CONNECTED) 1] ls /brokers

[ids, seqid, topics]

[zk: localhost:2181(CONNECTED) 2] ls /brokers/ids

[1]

We’ll dive deeper into Zookeeper and its role in Kafka cluster deployment in the next blog posts about Kafka.

Few important points:

we used ZooKeeper CLI client to connect to ZooKeeper server inside the container.
ALLOW_ANONYMOUS_LOGIN environment variable is set to yes. Therefore, there was no need to authenticate vs Kafka. Of course, this is not recommended for production.

Kafka

Inspect Kafka log to see its configuration and Zookeeper related logs:

$ docker-compose -p kafka-demo logs -f kafka
...

Note, that both Kafka and ZooKeeper data is persisted on docker host using docker volumes.

Kafka Producer and Kafka Consumer

Both Kafka Producer and Consumer are based on Confluent docker images of kcat.

How Kafka data is stored on disk

Basically, each topic’s partition is a commit log. Kafka appends data to the end of the log. This achieves ordering guarantee on a topic’s partition level (we have one topic with one partition). Each appended message is assigned an offset. Consumers are able to read the log from the last known offset they are aware of. What’s great about it is that any application can read the data from the last “commit” known to them at their own pace.

Kafka basic terms

We’ll finish by exploring Kafka terms.

Broker – server where the data resides.

Consumer – receives messages from a topic

Producer – sends messages to a topic

Messages – byte arrays with no specific format optionally with attached keys. Can be sent with schema (AVRO, JSON, XML) to be validated by the compiler.

Topic – way of classification and structuring messages. Equivalent to DB table.

Summary

That’s all for now about Kafka basics. In the next post we’ll explore how to deploy and monitor highly available Kafka cluster 🙂

Feel free to share. You may also find below articles interesting:

Bonus: Recommended Kafka courses on Pluralsight I learned from.

Sign up using this link to get exclusive discounts like 50% off your first month or 15% off an annual subscription)

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
PHPSESSID	session	Used by PHP to to provide functions across pages.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
wordpres_test_cookie		Test if cookie can be set. WordPress also sets wordpress_test_cookie cookie to check if the cookies are enabled on the browser to provide appropriate user experience to the users. This cookie is used on the front-end, even if you are not logged in.
wp_lang	session	Used by WordPress to store language settings.

Cookie	Duration	Description
_ga	2 years	Used to distinguish users by Google Analytics
_ga_*	1 year	Used by Google Analytics to store and count page views.