The latest release of Apache Hadoop code includes a new workload management tool that backers of the project say will make it easier for developers to build applications for the big data platform.
Hadoop has proven itself as a powerful way for some of the leading technology companies in the world like Yahoo and Google to manage large amounts of data. Hadoop systems have thus far relied on MapReduce to process data, but included in the latest iteration of the open source code is Yarn, which is a platform to run other applications within Hadoop alongside MapReduce. Yarn monitors the resources applications need and then provisions the capacity within the distributed computing system.
Hadoop enthusiasts say this is an important feature to let more applications run within the big data open system and could lead to a wave of new analytics apps for Hadoop. "Yarn is on the critical path to Hadoop having better resource management and supporting mixed workloads and usages," says Gartner information management analyst Merv Adrian, who tracks Hadoop. "It fixes some major gaps and will enable some exciting developments in the years ahead."
The 2.0 version adds a number of components, including architecting for high availability, and adding scale to individual clusters, allowing them to grow to 4,000 machines (a Hadoop deployment can consist of multiple clusters). The biggest change though is the addition of Yarn, which has been in planning for four years and under development for two and been described by some as a next-generation MapReduce architecture.
Yarn splits up two major functions currently combined into one by MapReduce; it separates job scheduling/monitoring and resource management. It works by monitoring what resources applications need, then creates containers of CPU and RAM nodes to serve to those apps. "Yarn is fundamentally simple, but extremely scalable," says Arun Murthy, co-founder of Hadoop distribution company Hortonworks, who has been in charge of developing Yarn within the Apache open source community. Blogger Brian Proffitt at ReadWrite notes that Yarn removes "one-at-a-time" limitations of apps running on Hadoop, and allows the Hadoop systems to now run multiple applications at once.
The advantages are multifold. For one, Hadoop is adding functionally to run multiple applications at once. Second, developers can now write apps to Yarn specifications and be assured that they'll work in a Hadoop system. MapReduce can also now focus on its core functionality instead of managing resources for bolt-on apps.
Hadoop backers expect that the advent of Yarn could open the floodgates for new applications being built to run on Hadoop. Already some projects, like Apache Tez, have been created to do more advanced data processing compared to what MapReduce specializes in. Tez uses real-time analytics and in-memory processing for higher-speed queries, for example. There are many more applications expected for streaming analytics. Twitter Storm is one, while other ETL (extract, transform and load) apps could be integrated as well.
Sign up for CIO Asia eNewsletters.