Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Comparing the top Hadoop distributions

Kirill Grigorchuk | June 30, 2014
Hadoop introduced a new way to simplify the analysis of large data sets, and in a very short time reshaped the big data market. In fact, today Hadoop is often synonymous with the term big data.

Hadoop introduced a new way to simplify the analysis of large data sets, and in a very short time reshaped the big data market. In fact, today Hadoop is often synonymous with the term big data.

Since Hadoop is an open source project, a number of vendors have developed their own distributions, adding new functionality or improving the code base. This article by Altoros, a big data specialist, provides an overview of the major distributions, describing how they differ from the standard edition.

A standard open source Hadoop distribution (Apache Hadoop) includes:

The Hadoop MapReduce framework for running computations in parallel

The Hadoop Distributed File System (HDFS)

Hadoop Common, a set of libraries and utilities used by other Hadoop modules

This is only a basic set of Hadoop components; there are other solutions — such as Apache Hive, Apache Pig, and Apache Zookeeper, etc. — that are widely used to solve specific tasks, speed up computations, optimize routine tasks, etc.

Vendor distributions are, of course, designed to overcome issues with the open source edition and provide additional value to customers, with a focus on things such as:

Reliability. The vendors react faster when bugs are detected. They promptly deliver fixes and patches, which makes their solutions more stable.

Support. A variety of companies provide technical assistance, which makes it possible to adopt the platforms for mission-critical and enterprise-grade tasks.

Completeness. Very often Hadoop distributions are supplemented with other tools to address specific tasks.

In addition, vendors participate in improving the standard Hadoop distribution by giving back updated code to the open source repository, fostering the growth of the overall community.

Three of the top Hadoop distributions are provided by Cloudera, MapR and Hortonworks. The chart below illustrates the results of the market research Big Data Vendor Revenue and Market Forecast 20122017. It compares the revenue of these major Hadoop vendors in 2012.

While Cloudera and Hortonworks claim they are 100% open source, MapR adds some proprietary components to the M3, M5, and M7 Hadoop distributions to improve the frameworks stability and performance.

Along with Cloudera, MapR and Hortonworks, Hadoop distributions are available from IBM, Intel, Pivotal Software, and others. These distributions may even be shipped as a part of a software suite (e.g., IBMs distribution), or designed to solve specific tasks (e.g., Intels distribution optimized for the Xeon microprocessor).

Key features of three popular Hadoop distributions

The values in the cells of the table refer to the versions of the corresponding components available in a particular Hadoop distribution. For performance comparisons, see our Hadoop Distributions: Cloudera vs. Hortonworks vs. MapR study.

There was only one NameNode, which managed the whole cluster. It was dealing with all metadata operations and stored metadata in RAM. With scalability limited to approximately 4,000 nodes and 40,000 tasks, this node was a single point of failure.

 

1  2  3  4  Next Page 

Sign up for CIO Asia eNewsletters.