A long-standing joke about the Hadoop ecosystem is that if you don't like an API for a particular system, wait five minutes and two new Apache projects will spring up with shiny new APIs to learn.
It's a lot to keep up with. Worse, it leads to a lot of work migrating to different projects merely to keep current. "We've implemented our streaming solution inStorm! Now we've redone it in Spark! We're currently undergoing a rewrite of the core in Apache Flink (or Apex)! ... and we've forgotten what business case we were attempting to solve in the first place."
Enter Apache Beam, a new project that attempts to unify data processing frameworks with a core API, allowing easy portability between execution engines.
Now, I know what you're thinking about the idea of throwing another API into the mix. But Beam has a strong heritage. It comes from Google and its research on theMillwheel and FlumeJava papers, as well as operational experience in the years following their publication. It defines a somewhat familiar directed acyclic graphdata processing engine with the capability of handling unbounded streams of data where out-of-order delivery is the norm rather than the exception.
But wait, I hear some of you cry. Isn't that Google Cloud Dataflow? Yes! And no. Google Cloud Dataflow is a fully managed service where you write applications using theDataflow SDK and submit them to run on Google's servers. Apache Beam, on the other hand, is simply the Dataflow SDK and a set of "runners" that map the SDK primitives to a particular execution engine. Yes, you can run Apache Beam applications on Google Cloud Dataflow, but you can also use Apache Spark or Apache Flink with little to no changes in your code.
Ride with Apache Beam
There are four principal concepts of the Apache Beam SDK:
- Pipeline: If you've worked with Spark, this is somewhat analogous to the SparkContext. All your operations will begin with the pipeline object, and you'll use it to build up data streams from input sources, apply transformations, and write the results out to an output sink.
- PCollection: PCollections are similar to Spark's Resilient Distributed Dataset (RDD) primitive, in that they contain a potentially unbounded stream of data. These are built from pulling information from the input sources, then applying transformations.
- Transforms: A processing step that operates on a PCollection to perform data manipulation. A typical pipeline will likely have multiple transforms operating on an input source (for example, converting a set of incoming strings of log entries into a key/value pair, where the key is an IP address and the value is the log message). The Beam SDK comes with a series of standard aggregations built in, and of course, you can define your own for your own processing needs.
- I/O sources and sinks: Lastly, sources and sinks provide input and output endpoints for your data.
Sign up for CIO Asia eNewsletters.