When the big data movement started it was mostly focused on batch processing. Distributed data storage and querying tools like MapReduce, Hive, and Pig were all designed to process data in batches rather than continuously. Businesses would run multiple jobs every night to extract data from a database, then analyze, transform, and eventually store the data. More recently enterprises have discovered the power of analyzing and processing data and events as they happen, not just once every few hours. Most traditional messaging systems don't scale up to handle big data in realtime, however. So engineers at LinkedIn built and open-sourced Kafka: a distributed messaging framework that meets the demands of big data by scaling on commodity hardware.
Over the past few years, Kafka has emerged to solve a variety of use cases. In the simplest case, it could be a simple buffer for storing application logs. Combined with a technology like Spark Streaming, it can be used to track data changes and take action on that data before saving it to a final destination. Kafka's predictive mode makes it a powerful tool for detecting fraud, such as checking the validity of a credit card transaction when it happens, and not waiting for batch processing hours later.
This two-part tutorial introduces Kafka, starting with how to install and run it in your development environment. You'll get an overview of Kafka's architecture, followed by an introduction to developing an out-of-the-box Kafka messaging system. Finally, you'll build a custom producer/consumer application that sends and consumes messages via a Kafka server. In the second half of the tutorial you'll learn how to partition and group messages, and how to control which messages a Kafka consumer will consume.
What is Kafka?
Apache Kafka is messaging system built to scale for big data. Similar to Apache ActiveMQ or RabbitMq, Kafka enables applications built on different platforms to communicate via asynchronous message passing. But Kafka differs from these more traditional messaging systems in key ways:
- It's designed to scale horizontally, by adding more commodity servers.
- It provides much higher throughput for both producer and consumer processes.
- It can be used to support both batch and real-time use cases.
- It doesn't support JMS, Java's message-oriented middleware API.
Before we explore Kafka's architecture, you should know its basic terminology:
- A producer is process that can publish a message to a topic.
- a consumer is a process that can subscribe to one or more topics and consume messages published to topics.
- A topic category is the name of the feed to which messages are published.
- A broker is a process running on single machine.
- A cluster is a group of brokers working together.
Sign up for CIO Asia eNewsletters.