The updated version of its Hadoop distribution represents the culmination of its Stinger Initiative for interactive SQL query, as well as layering on a host of new enterprise features like data governance, security, stream processing and search capabilities.
"For Hadoop to function as a true enterprise data platform, there are certain requirements that must be met," says Jim Walker, director of product marketing at Hortonworks. "There's a very clear definition among practitioners on what that requires: data governance, data access, data management, security and operations. HDP 2.1 packages all this together into a single package, which is enterprise Hadoop."
HDP 2.1 packages the latest stable builds of a whole host of Apache open source projects. For interactive SQL query in Hadoop, it delivers Apache Hive 0.13, the final phase of the Stinger Initiative community effort to deliver interactive SQL query at petabyte scale in Hadoop.
For the past 13 months, the Apache Hive community has been focused on the initiative, adding more than 390,000 lines of code to Hive from 145 developers hailing from more than 45 unique organizations, including Microsoft, Teradata and SAP.
Walker notes that with Apache Hive 0.13, Hive has received a 100X increase in SQL query performance, enables interactive query at petabyte scale and across a wide range of complex queries and joins and is capable of an expanded range of SQL semantics for analytic applications running in Hadoop.
In the data governance and security spheres, HDP 2.1 is drawing on Apache Falcon and Apache Knox. Falcon offers a data processing framework for governing and orchestrating data flows in and around Hadoop. It provides the key governance framework for acquisition and processing of data sets, replication and retention of data sets, redirecting datasets on non-Hadoop extensions and maintaining an audit trail and lineage.
Knox extends perimeter security for Hadoop, all fully integrated with frameworks like LDAP and Active Directory for credential management. It provides a common place to preform authentication across Hadoop and all related projects.
In data processing, the updated platform includes two new processing engines: Apache Storm and Apache Solr. Storm provides real-time event processing for sensor and business activity monitoring. It's a key component of building a data lake architecture because it allows you to ingest millions of events per second, enabling fast query on petabytes of data.
Meanwhile, Solr, integrated with HDP as part of a deep technical partnership with LucidWorks, provides open source enterprise search - high-performance indexing and sub-second search times over billions of documents.
Finally, Apache Ambari, the framework for provisioning, managing and monitoring Apache Hadoop clusters, has been updated to version 1.5.1 in HDP 2.1, adding support for new data access engines, stack extensibility, pluggable views, rolling restarts, maintenance mode and more.
Sign up for CIO Asia eNewsletters.