Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Comparing the top Hadoop distributions

Kirill Grigorchuk | June 30, 2014
Hadoop introduced a new way to simplify the analysis of large data sets, and in a very short time reshaped the big data market. In fact, today Hadoop is often synonymous with the term big data.

It was impossible to update Hadoop components on some of the nodes.

The MapReduce paradigm can be applied to only a limited type of tasks.

There were no other models (other than MapReduce) of data processing.

Resources of a cluster were not utilized in the most effective way.

While most distributions were developed to address the limitations, they did not introduce any significant architectural changes compared to the open source version. Thats what made Hadoop 2.0 a real breakthrough when it emerged in 2013. In particular, it features YARN (Yet Another Resource Negotiator), a new cluster management system that turns Hadoop from a batch data processing solution into a real multi-application platform. The updated version eliminated the following issues:

Vulnerability of a system with a single NameNode (a single point of failure)

The possible number of nodes in a cluster was greatly increased.

YARN extends the number of tasks that can be successfully solved with Hadoop

The figure below illustrates the multi-application principle implemented in Hadoop 2.0, and shows that YARN is actually a layer between HDFS and data processing applications.

The main idea of YARN is to split up two major tasksresource management and schedulinginto two separate concepts. YARN has a central ResourceManager and an ApplicationMaster, which is created for each application separately. This approach allows for running batch, interactive, in-memory, streaming, online, graph, and other types of applications simultaneously. The figures 3 and 4 below demonstrate the architectural differences in the two Hadoop versions.

Hadoop 1.0 had a single JobTracker, which had to deal with thousands of TaskTrackers and MapReduce tasks. This architecture limited scalability options and enabled a cluster to run a single application at a time.

Hadoop 2.0 has a single ResourceManager and multiple ApplicationMasters. Since each application is managed by a separate ApplicationMaster, it is no longer a bottleneck in a cluster. As stated in the notes from the Hortonworks development team, they were able to simulate 10,000 node clusters composed of modern hardware without significant issue. Separation of cluster management tasks from an application life cycle resulted in greatly improved cluster scalability.

At the same time, with a global ResourceManager, YARN provides much better resources utilization, which also adds to spinning up a cluster. YARN allows for running different applications that share a common pool of resources. There are no pre-defined Map and Reduce slots, which helps to better utilize resources inside a cluster.

The ability to run non-MapReduce tasks inside Hadoop turned YARN into a next-generation data processing tool. Hadoop 2.0 features additional programming models, such as graph processing and iterative modeling, which extended the range of tasks that can be solved using this tool.

 

Previous Page  1  2  3  4  Next Page 

Sign up for CIO Asia eNewsletters.