Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

16 for '16: What you must know about Hadoop and Spark right now

Andrew C. Oliver | Jan. 8, 2016
Amazingly, Hadoop has been redefined in the space of a year. Let's take a look at all the salient parts of this roiling ecosystem and what they mean

There are good reasons to use alternatives like Cassandra. But if you use Hadoop you already have HBase -- if you've purchased support from a Hadoop vendor, you already have HBase support -- so it's a good place to start. After all, it's a low-latency, persistent datastore that can provide a reasonable amount of ACID support. If Hive and Impala underwhelm you with their SQL performance, you'll find HBase and Phoenix faster for some datasets.

6. Impala

Teradata and Netezza use MPP to process SQL queries across distributed storage. Impala is essentially an MPP solution built on HDFS.

The biggest difference between Impala and Hive is that, when you connect your favorite BI tool, "normal stuff" will run in seconds rather than in minutes. Impala can replace Teradata and Netezza for many applications. Other structures may be necessary for different types of queries or analysis (for those, look toward stuff like Kylin and Phoenix). But generally, Impala lets you escape your hated proprietary MPP system, use one platform for structured and unstructured data analysis, and even deploy to the cloud.

There's plenty of overlap with using straight Hive, but Impala and Hive operate in different ways and have different sweet spots. Impala is supported by Cloudera and not Hortonworks, which supports Phoenix instead. While operating Impala is less complex, you can achieve some of the same goals with Phoenix, toward which Cloudera is now moving.

7. HDFS (Hadoop Distributed File System)

Thanks to the rise of Spark and ongoing migration to the cloud for so-called big data projects, HDFS is less fundamental than it was last year. But it's still the default and one of the more conceptually simple implementations of a distributed filesystem.

8. Kafka

Distributed messaging such as that offered by Kafka will make client-server tools like ActiveMQ completely obsolete. Kafka is used in many -- if not most -- streaming projects. It's also really simple. If you've used other messaging tools, it may feel a little primitive, but in the majority of cases, you don't need the granular routing options MQ-type solutions offer anyway.

9. Storm/Apex

Spark isn't so great at streaming, but what about Storm? It's faster, has lower latency, and uses less memory -- which is important when ingesting streaming data at scale. On the other hand, Storm's management tools are doggy-doo and the API isn't as nice as Spark's. Apex is newer and better, but it's not widely deployed yet. I'd still default to Spark for everything that doesn't need to be subsecond.

10. Ambari/Cloudera Manager

I've seen people try and monitor and manage Hadoop clusters without Ambari or Cloudera Manager. It isn't pretty. These solutions have taken management and monitoring of Hadoop environments a long way in a relatively short period of time. Compare this to the NoSQL space, which is nowhere near as advanced in this department -- despite simpler software with far fewer components -- and you gotta wonder where those NoSQL guys spent their massive funding. 


Previous Page  1  2  3  4  5  6  Next Page 

Sign up for CIO Asia eNewsletters.