To process data arriving at tens of thousands to millions of events per second, you will need two technologies: First, a streaming system capable of delivering events as fast as they come in; and second, a data store capable of processing each item as fast as it arrives.
Delivering the fast data
Two popular streaming systems have emerged over the past few years: Apache Storm and Apache Kafka. Originally developed by the engineering team at Twitter, Storm can reliably process unbounded streams of data at rates of millions of messages per second. Kafka, developed by the engineering team at LinkedIn, is a high-throughput distributed message queue system. Both streaming systems address the need of processing fast data. Kafka, however, stands apart.
Kafka was designed to be a message queue and to solve the perceived problems of existing technologies. It's sort of an über-queue with unlimited scalability, distributed deployments, multitenancy, and strong persistence. An organization could deploy one Kafka cluster to satisfy all of its message queueing needs. Still, at its core, Kafka delivers messages. It doesn't support processing or querying of any kind.
Processing the fast data
Messaging is only part of a solution. Traditional relational databases tend to be limited in performance. Some may be able to store data at high rates, but fall over when they are expected to validate, enrich, or act on data as it is ingested. NoSQL systems have embraced clustering and high performance, but sacrifice much of the power and safety that traditional SQL-based systems offered. For basic fire hose processing, NoSQL solutions may satisfy your business needs. However, if you are executing complex queries and business logic operations per event, in-memory NewSQL solutions can satisfy your needs for both performance and transactional complexity.
Like Kafka, some NewSQL systems are built around shared-nothing clustering. Load is distributed among cluster nodes for performance. Data is replicated among cluster nodes for safety and availability. To handle increasing loads, nodes can be transparently added to the cluster. Nodes can be removed — or fail — and the rest of the cluster will continue to function. Both the database and the message queue are designed without single points of failure. These features are the hallmarks of systems designed for scale.
In addition, Kafka and some NewSQL systems have the ability to leverage clustering and dynamic topology to scale, without eschewing strong guarantees. Kafka provides message-ordering guarantees, while some in-memory processing engines provide serializable consistency and ACID semantics. Both systems use cluster-aware clients to deliver more features or to simplify configuration. Finally, both achieve redundant durability through disks on different machines, rather than RAID or other local storage schemes.
Big data plumbers toolkit
What do you look for in a system for processing the big data fire hose?
- Look for a system with the redundancy and scalability benefits of native shared-nothing clustering.
- Look for a system that leans on in-memory storage and processing to achieve high per-node throughput.
- Look for a system that offers processing at ingestion time. Can the system perform conditional logic? Can it query gigabytes or more of existing state to inform decisions?
- Look for a system that isolates operations and makes strong guarantees about its operations. This allows users to write simpler code and focus on business problems, rather than handling concurrency problems or data divergence. Beware of systems that offer strong consistency but at greatly reduced performance.
Sign up for CIO Asia eNewsletters.