Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Why Google Cloud Dataflow is no Hadoop killer

Serdar Yegulalp | June 30, 2014
Google's new data processing service may look like it's designed to lure users away from Hadoop, but its focus is more selective.

Unveiled earlier this week, Google's Cloud Dataflow service clearly competes against Amazon's streaming-data processing serviceKinesis and big data products like Hadoop — particularly since Cloud Dataflow is built on technology that Google claims replaces the algorithms behind Hadoop.

But on closer look, Cloud Dataflow is better thought of as a way for Google Cloud users to enrich the applications they develop — and the data they deposit — with analytics components. A Hadoop killer? Probably not.

Google bills the service as "the latest step in our effort to make data and analytics accessible to everyone," with an emphasis on the application you're writing rather than the data you're manipulating.

Significantly, Google Cloud Dataflow is meant to replace MapReduce, the software at the heart of Hadoop and other big data processing systems. MapReduce was originally developed by Google and later open-sourced, but Urs Hölzle, senior vice president of technical infrastructure, declared in the Google I/O keynote on Wednesday that "we [at Google] don't really use MapReduce anymore."

In place of MapReduce, Google uses two other projects, Flume and MillWheel, that apparently influenced Dataflow's design. The former lets you manage parallel piplines for data processing, which MapReduce didn't provide on its own. The latter is described as "a framework for building low-latency data-processing applications," and has apparently been in wide use at Google for some time.

Most prominent, Cloud Dataflow is touted as superior to MapReduce in the amount of data that can be processed efficiently. Hölzle claimed MapReduce's poor performance began once the amount of data reached the multipetabyte range. For perspective, Facebook claimed in 2012 it had a 100-petabyte Hadoop cluster, although the company did not go into detail about how much custom modification was used or even if MapReduce itself was still in operation.

Ovum analyst Tony Baer sees Google Cloud Dataflow as "part of an overriding trend where we are seeing an explosion of different frameworks and approaches for dissecting and analyzing big data. Where once big data processing was practically synonymous with MapReduce," he said in an email, "you are now seeing frameworks like Spark, Storm, Giraph, and others providing alternatives that allow you to select the approach that is right for the analytic problem."

Hadoop itself seems to be tilting away from MapReduce in favor of more advanced (if demanding) processing algorithms, such as Apache Spark. "Many problems do not lend themselves to the two-step process of map and reduce," explained InfoWorld's Andy Oliver, "and for those that do, Spark can do map and reduce much faster than Hadoop can."

Baer concurs: "From the looks of it, Google Cloud Dataflow seems to have a resemblance to Spark, which also leverages memory and avoids the overhead of MapReduce."


1  2  Next Page 

Sign up for CIO Asia eNewsletters.