Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Apache Beam wants to be uber-API for big data

Ian Pointer | April 14, 2016
New, useful Apache big data projects seem to arrive daily. Rather than relearn your way every time, what if you could go through a unified API?

Let's look at a complete Beam program. For this, we'll use the still-quite-experimental Python SDK and the complete text of Shakespeare's "King Lear":

import re

import google.cloud.dataflow as df

p = df.Pipeline('DirectPipelineRunner')

(p

| df.Read('read',

            df.io.TextFileSource(

            'gs://dataflow-samples/shakespeare/kinglear.txt'))

| df.FlatMap('split', lambda x: re.findall(r'\w+', x))

| df.combiners.Count.PerElement('count words')

| df.Write('write', df.io.TextFileSink('./results')))

p.run()

After importing the regular expression and Dataflow libraries, we construct a Pipeline object and pass it the runner that we wish to use (in this case, we're usingDirectPipelineRunner, which is the local test runner).

From there, we read from a text file (with a location pointing to a Google Cloud Storage bucket) and perform two transformations. The first is flatMap, which we pass a regular expression into in order to break each string up into words -- and return a PCollection of all the separate words in "King Lear." Then we apply the built-in Count operation to do our word count.

The final part of the pipeline writes the results of the Countoperation to disk. Once the pipeline is defined, it is invoked with the run() method. In this case, the pipeline is submitted to the local test runner, but by changing the runner type, we could submit to Google Cloud Dataflow, Flink, Spark, or any other runner available to Apache Beam.

Runners dial zero

Once we have the application ready, it can be submitted to run on Google Cloud Dataflow with no trouble, as it is simply using the Dataflow SDK.

The idea is that runners will be provided for other execution engines. Beam currently includes runners supplied by DataArtisans andCloudera for Apache Flink and Apache Spark. This is where some of the current wrinkles of Beam come into play because the Dataflow model does not always map easily onto other platforms.

capability matrix available on the Beam website shows you which features are and are not supported by the runners. In particular, there are extra hoops you need to jump through in your code to get the application working on the Spark runner. It's only a few lines of extra code, but it isn't a seamless transition.

It's also interesting to note that the Spark runner is currently implemented using Spark's RDD primitive rather than DataFrames. As this bypasses Spark's Catalyst optimizer, it's almost certain right now that a Beam job running on Spark will be slower than running a DataFrame version. I imagine this will change when Spark 2.0 is released, but it's definitely a limitation of the Spark runner over and above what's presented in the capability matrix.

At the moment, Beam only includes runners for Google Cloud Dataflow, Apache Spark, Apache Flink, and a local runner for testing purposes -- but there's talk of creating runners for frameworks like Storm and MapReduce. In the case of MapReduce, any eventual runner will be able to support a subset of the what Apache Beam provides, as it can work only with what the underlying system provides. (No streaming for you, MapReduce!)

 

Previous Page  1  2  3  Next Page 

Sign up for CIO Asia eNewsletters.