* Extending functionality for specific tasks. Companies that offer Hadoop distributions are always looking to add modules that introduce new capabilities to the framework. Clouderas distribution, for instance, contains full-text search and Impala, an engine for real-time processing of data stored in HDFS using SQL queries. Hortonworks has added support for SQL semantics in the Stinger Initiative and is developing Apache Tez, a new architecture that would help to accelerate iterative tasks by eliminating unnecessary tasks and improving write/reads to/from HDFS. Wandisco provides cross-data center replication with its Non-Stop Hadoop technology.
Today, Hadoop is not only an integral part of the big data ecosystem but is a central force that gave a new start to the set of related tools. Though the adoption of Hadoop 1.0 for enterprise systems was limited to particular types of workloads, the situation will change with YARN.
The new architecture extends the range of cases that can be addressed with Hadoop. If used together with Storm, for instance, it would accelerate processing unbounded streams of data; in combination with SPARK, it would foster data analytics initiatives; and with Tez, it would make iterative algorithms work much faster.
This article overviews only the trends of the ecosystem and does not compare performance. It is still hard to find the performance results for real-life YARN clusters based on Hadoop distributions. Exhaustive figures are available for Hadoop 1.0 only: here (already updated with Hive on HDP 2.0) and here (Hortonworks vs. Cloudera vs. MapR).
The reason is simple. In case of Cloudera, the new architecture is still in beta; MapR has scheduled its 2.0 release for March 2014. Most of other vendors are also in the development process. So, it will be interesting to compare the performance of Hadoop 1.0 vs. 2.0 in action and find out how the difference affects the overall cluster built on top of a Hadoop distribution.
Sign up for CIO Asia eNewsletters.