Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

16 for '16: What you must know about Hadoop and Spark right now

Andrew C. Oliver | Jan. 8, 2016
Amazingly, Hadoop has been redefined in the space of a year. Let's take a look at all the salient parts of this roiling ecosystem and what they mean

Technologies I'd rather forget

Here's the stuff I am happily throwing under the bus. I have that luxury because new technologies have emerged to perform the same functions better.

Oozie: At All Things Open this year, Ricky Saltzer from Cloudera defended Oozie and said it was good for what it was originally intended to do -- that is, chain a couple MapReduce jobs together -- and dissatisfaction with Oozie stemmed from people overextending its purpose. I still say Oozie was bad at all of it.

Let's make a list: error-hiding, features that don't work or work differently than documented, totally incorrect documentation with XML errors in it, a broken validator, and more. Oozie simply blows. It was written poorly and even elementary tasks become week-long travails when nothing works right. You can tell who actually works with Hadoop on a day-to-day basis versus who only talks about it because the professionals hate Oozie more. With Nifi and other tools taking over, I don't expect to use Oozie much anymore.

MapReduce: The processing heart of Hadoop is on the way out. A DAG algorithm is a better use of resources. Spark does this in memory with a nicer API. The economic reasons that justified sticking with MapReduce recede as memory gets ever cheaper and the move to the cloud accelerates.

Tez: To some degree, Tez is a road not taken -- or a neanderthal branch of the evolutionary tree of distributed computing. Like Spark, it's a DAG algorithm, although one of its developers described it as an assembly language.

As with MapReduce, the economic rationale (disk versus memory) for using Tez is receding. The main reason to continue using it: The Spark bindings for some popular Hadoop tools are less mature or not ready at all. However, with Hortonworks joining the move to Spark, it seems unlikely Tez will have a place by the end of the year. If you don't know Tez by now, don't bother.

Now's the time

The Hadoop/Spark realm changes constantly. Despite some fragmentation, the core is about to become a lot more stable as the ecosystem coalesces around Spark.

The next big push will be around governance and application of the technology, along with tools to make cloudification and containerization more manageable and straightforward. Such progress presents a major opportunity for vendors that missed out on the first wave.

Good timing, then, to jump into big data technologies if you haven't already. Things evolve so quickly, it's never too late. Meanwhile, vendors with legacy MPP cube analytics platforms should prepare to be disrupted.

Source: Infoworld


Previous Page  1  2  3  4  5  6 

Sign up for CIO Asia eNewsletters.