Categories
Big Data DevOps Streaming

Introduction into Kafka in Production – Part 1

Introduction

Does anyone know what kafkasque is? I didn’t until I started working with Apache Kafka and discovered its nightmarish complexity 🙂 As with any technology, gaining knowledge makes complexity go away or at least complexity becomes less nightmarish. 

In this (very) opinionated introduction we are going to dive into a few important aspects of Kafka. In addition, we’ll deploy basic Kafka demo using Docker. Of course, this post doesn’t replace existing great courses, books and official documentation. 

Though, it gives you a basic understanding of what Kafka is and how it works. 

Kafka introduction

I’ve come through many definitions of what Kafka is.

For instance:

  • high-throughput distributed messaging system
  • distributed commit-log

Regardless of the exact definition, all of them address a common enterprise scenario: moving data between applications and datastores. 

Kafka started as an internal project in LinkedIn in 2009 to solve the need for exponentially growing data as the company grew. 

It was designed with below features in mind:

  • high throughput 
  • horizontally scalable  
  • reliable and durable
  • loosely coupling between producers and consumers
  • publish-subscribe semantics 

Therefore, it’s not a coincidence that Kafka has become perfect fit for:

  • microservices based architectures
  • event-driven architectures

Today at LinkedIn, Kafka operates more than 1 trillion messages per day. 

Kafka basic use case architecture 

Kafka small demo

To demonstrate basic usage scenario, we’ll raise basic Kafka demo using docker and docker-compose.

Prerequisites (use latest versions in case of doubt)

  • docker
  • docker-compose
  • git

Quickstart

Follow the below steps to raise a sample Kafka multi-container application:

  • clone the repository from my github and cd into it.
  • run the application stack: docker-compose -p kafka-demo up -d
  • view stack containers:
$ docker-compose -p kafka-demo ps
     Name                   Command                  State                       Ports                 
-------------------------------------------------------------------------------------------------------
kafka            /opt/bitnami/scripts/kafka ...   Up (healthy)   9092/tcp                              
kafka-consumer   kafkacat -b kafka:9092 -C  ...   Up                                                   
kafka-producer   kafkacat -b kafka:9092 -t  ...   Exit 0                                               
zookeeper        /opt/bitnami/scripts/zooke ...   Up (healthy)   2181/tcp, 2888/tcp, 3888/tcp, 8080/tcp
  • restart Kafka to let produced messages messages persist to disk: $ docker-compose -p kafka-demo restart kafka
  • inspect logs of kafka consumer: docker-compose -p kafka-demo logs -f kafka-consumer
  • you should see consumed messages produced by Kafka producer from my_msgs.txt:

kafka-consumer | Key (-1 bytes):
kafka-consumer | Value (92 bytes): {1:{"order_id":1,"order_ts":1534772501276,"total_amount":10.50,"customer_name":"Bob Smith"}}
kafka-consumer | 
artition: 0 Offset: 0
kafka-consumer | --
kafka-consumer |
kafka-consumer | Key (-1 bytes):
kafka-consumer | Value (93 bytes): {2:{"order_id":2,"order_ts":1534772605276,"total_amount":3.32,"customer_name":"Sarah Black"}}
kafka-consumer | 
artition: 0 Offset: 1
kafka-consumer | --
kafka-consumer |
kafka-consumer | Key (-1 bytes):
kafka-consumer | Value (94 bytes): {3:{"order_id":3,"order_ts":1534772742276,"total_amount":21.00,"customer_name":"Emma Turner"}}
kafka-consumer | 
artition: 0 Offset: 2

Quickstart deep dive 

So what happened here? Let’s go over the most interesting points.

ZooKeeper 

In short, ZooKeeper is Kafka cluster coordinator. We need it regardless if Kafka deployed as a single node or cluster of nodes. It’s important to note that, Kafka will remove ZooKeeper from its deployments in the future. Yet, it’s still an experimental feature.

In our basic demo we have a single Kafka node. Run below commands in ZooKeeper container to view its metadata.

docker exec -it zookeeper zkCli.sh

It opens ZooKeeper shell. Run inside of it:


zk: localhost:2181(CONNECTED) 0] ls /

[admin, brokers, cluster, config, consumers, controller, controller_epoch, feature, isr_change_notification, latest_producer_id_block, log_dir_event_notification, zookeeper]

[zk: localhost:2181(CONNECTED) 1] ls /brokers

[ids, seqid, topics]

[zk: localhost:2181(CONNECTED) 2] ls /brokers/ids

[1]

We’ll dive deeper into Zookeeper and its role in Kafka cluster deployment in the next blog posts about Kafka. 

Few important points:

  • we used ZooKeeper CLI client to connect to ZooKeeper server inside the container.
  • ALLOW_ANONYMOUS_LOGIN environment variable is set to yes. Therefore, there was no need to authenticate vs Kafka. Of course, this is not recommended for production. 

Kafka

Inspect Kafka log to see its configuration and Zookeeper related logs:

$ docker-compose -p kafka-demo logs -f kafka
...

Note, that both Kafka and ZooKeeper data is persisted on docker host using docker volumes.

Kafka Producer and Kafka Consumer

Both Kafka Producer and Consumer are based on Confluent docker images of kcat

How Kafka data is stored on disk

Basically, each topic’s partition is a commit log. Kafka appends data to the end of the log. This achieves ordering guarantee on a topic’s partition level (we have one topic with one partition). Each appended message is assigned an offset. Consumers are able to read the log from the last known offset they are aware of. What’s great about it is that any application can read the data from the last “commit” known to them at their own pace.

Kafka basic terms

We’ll finish by exploring Kafka terms.

Broker – server where the data resides.

Consumer – receives messages from a topic 

Producer – sends messages to a topic

Messages – byte arrays with no specific format optionally with attached keys. Can be sent with schema (AVRO, JSON, XML) to be validated  by the compiler. 

Topic – way of classification and structuring messages. Equivalent to DB table.

Summary

That’s all for now about Kafka basics. In the next post we’ll explore how to deploy and monitor highly available Kafka cluster 🙂

Feel free to share. You may also find below articles interesting:

Bonus: Recommended Kafka courses on Pluralsight I learned from.

Sign up using this link to get exclusive discounts like 50% off your first month or 15% off an annual subscription)