The biggest thing you need to know about Hadoop is that it isn't Hadoop anymore.
Between Cloudera sometimes swapping out HDFS for Kudu while declaring Spark the center of its universe (thus replacing MapReduce everywhere it is found) and Hortonworks joining the Spark party, the only item you can be sure of in a "Hadoop" cluster is YARN. Oh, but Databricks, aka the Spark people, prefer Mesos over YARN -- and by the way, Spark doesn't require HDFS.
Yet distributed filesystems are still useful. Business intelligence is a great use case for Cloudera's Impala and Kudu, a distributed columnar store, is optimized for it. Spark is great for many tasks, but sometimes you need an MPP (massively parallel processing) solution like Impala to do the trick -- and Hive remains a useful file-to-table management system. Even when you're not using Hadoop because you're focused on in-memory, real-time analytics with Spark, you still may end up using pieces of Hadoop here and there.
By no means is Hadoop dead, although I'm sure that's what the next Gartner piece will say. But by no means is it only Hadoop anymore.
What in this new big data Hadoopy/Sparky world do you need to know now? I covered this topic last year, but there's so much new ground, I'm pretty much starting from scratch.
Spark is as fast as you've heard it is -- and, more important, the API is much easier to use and requires less code than with previous distributed computing paradigms. With IBM promising 1 million new Spark developers and a boatload of money for the project, Cloudera declaring Spark is the center of everything we know to be good with its One Platform initiative, and Hortonworks giving its full support, we can safely say the industry has crowned its Tech Miss Universe (hopefully getting it right this time).
Economics is also driving Spark's ascendance. Once it was costly to do it in memory, but with cloud computing and increased computing elasticity, the number of workloads that can't be loaded into memory (at least on a distributed computing cluster) are diminishing. Again, we're not talking about all your data, but the subset you need in order to calculate a result.
Spark is still rough around the edges -- we've really seen this when working with it in a production environment -- but the warts are worth it. It really is that much faster and altogether better.
The irony is that the loudest buzz around Spark relates to streaming, which is Spark's weakest point. Cloudera had that deficiency in mind when it announced its intention to make Spark streaming work for 80 percent of use cases. Nonetheless, you may still need to explore alternatives for subsecond or high-volume data ingestion (as opposed to analytics).
Sign up for CIO Asia eNewsletters.