In addition, were expecting to see rapid growth of applications that rest on YARN in the near future. Apache Giraph (for analyzing graphs, e.g. social connections on Facebook), Spark (machine learning and data mining), Apache HAMA (machine learning and graph algorithms), Storm (unbounded streams of data in the real-time), and others are adjusting to the new architecture.
Hadoop distributions tomorrow
There are several trends shaping the evolution of Hadoop distributions:
* YARN adoption. Hadoop 2.0 supports larger clusters, which enables running more computations simultaneously. It received a new cluster management system that fits a broader range of tasks, including support for more flexible data processing and consolidation algorithms. Therefore, Cloudera and Hortonworks were actively adopting it through 2013. Since MapR used some proprietary components in its distribution, it needed a bit more time. The release of MapR 2.0 that supports YARN is scheduled for March 2014. While it still uses its own file system instead of the default HDFS, it seems like this vendor probably shifted to a wider usage of open source Hadoop code, since it offers now more support for different open Hadoop components.
* Third-party integration for data consolidation. Hadoop distributions are integrating with third-party solutions for analyzing data. Cloudera, for instance, added connectors for binding CDH (Clouderas Distribution Including Apache Hadoop) with data analysis and reporting systems, such as Oracle, Tableau, Teradata, etc. CDH supports Talend Open Studio for Big Data, an easy-to-use graphical environment that allows developers to visually map big data sources and targets without the need to learn and write complicated code. This tool contains 450+ connectors for getting data from a variety of data sources.
* Significant performance improvements. Cloudera recently announced Spark support. With in-memory computations, this model can greatly speed up data processing, up to 100x in some cases. Hortonworks is also working on improving computing speed. The company initiated Stinger, a project that is aimed at making Apache Hive queries up to 100x faster. It is also working to optimize stored data to speed its processing.
Apache Drill, a project backed by MapR, aims at solving similar tasks. It is based on the model published by Google in the white paper Dremel: Interactive Analysis of Web-Scale Datasets. However, the project is quite new and may not be ready for production deployments.
Pivotal Software delivered PivotalHD, a Hadoop distribution that features HAWQ, a proprietary component able to process SQL-like queries 318x faster than Hive. Unfortunately, there is no independent evaluation that can prove these results.
If you are interested in third-party performance benchmarks of similar systems with Massively Parallel Processor architectures, you can check the figures from AMPLab Berkeley.
* Data security. Obviously, Hadoop vendors will be working harder to improve security of data access, restrict permissions, and address a wider range of data protection issues.
Sign up for CIO Asia eNewsletters.