When you think of streaming analytics, very likely Spark comes to mind. Even though Spark is a quasi-streaming offering, the Structured Streaming functionality introduced in Spark 2.0 is a big step up. You might also think about Storm, a true streaming solution that in version 1.0 is trying to buck its reputation for being hard to use.
Add Apache Apex, which debuted in June 2015, to your list of stream processing possibilities. Apex comes from DataTorrent and its impressive RTS platform, which includes a core processing engine; a suite of dashboarding, ingesting, and monitoring tools; and dtAssemble, a graphical flow-based programming system aimed at data scientists.
RTS may not have the buzz of Spark, but it has seen production deployments at GE and Capital One. It has demonstrated that it can scale to billions of events per second and respond to events in less than 20ms.
Apex, the processing engine at the core of their RTS platform, is DataTorrent's submission to Apache. Apex is designed to run in your existing Hadoop ecosystem, using YARN to scale up or down as required and leveraging HDFS for fault tolerance. Although it doesn't provide all the bells and whistles of the full RTS platform, Apex delivers the major functionality you'd expect from a data processing platform.
A sample Apex application
Let's look at a very basic Apex pipeline to examine some of the core concepts. In this example, I'll read logging lines from Kafka, taking a count of the types of log lines seen and writing the counts out to the console. I'll include code snippets here, but you can also find the complete application on GitHub.
Apex's core concept is the operator, which is a Java class that implements methods receiving input and generating output. (If you know Storm, you know they're similar in concept to bolts and spouts.) In addition, each operator defines a set of ports for either input or output of data. The methods will either read input from an
InputPort or send data downstream through an
The flow of data through an operator is modeled by breaking the stream down into time-based windows of data, but unlike Spark's microbatching, processing the input data does not have to wait until the end of the window.
In the example below, we need three operators, each of which corresponds to the three types of operator Apex supports: an input operator for reading lines from Kafka, a generic operator for counting the logging types, and an output operator for writing to the console. For the first and last, we can turn to Apex's Malhar library, but we need to implement our custom business logic for counting the different types of logging we're seeing.
Sign up for CIO Asia eNewsletters.