Databricks this week became the first company to make Apache Spark 2.0 generally available on its data platform.
The company, founded out of the UC Berkeley AMPLab by the team that created Apache Spark, says this latest release builds on what the community has learned in the past two years. It marks the first major release of open source Spark since the Spark 1.6 release in 2015.
"Since the release of Spark 1.0, we've spent countless hours listening to members of the Spark community and Databricks users to learn from a mix of praises and complaints," Reynold Xin, Databricks' chief architect and co-founder, said in a statement Tuesday. "Spark 2.0 builds on what the community has learned, doubling down on what users love and improving on what users lament."
A MapReduce option
Spark, a top-level Apache project that has become an increasingly popular alternative compute engine to MapReduce for powering big data applications, leverages in-memory primitives to improve performance over MapReduce for certain applications. It is well-suited to machine learning algorithms and interactive analytics.
The company launched a preview release of Apache Spark 2.0 on Databricks two months ago, and says 10 percent of clusters on the platform are already using the latest release.
The company outlined some of the major new features:
- Speed. Databricks says Spark 2.0 is five to 10 times faster than Spark 1.6 for some operators due to Tungsten's Phase 2 whole-stage-code generation and Catalysts code optimization.
- Simplicity. The new release unifies developer APIs across Spark libraries, including DataFrames and Datasets.
- Structured streaming. Spark 2.0 lays the foundation for continuous applications by providing high-level declarative streaming APIs based on DataFrames and Datasets built atop Spark SQL that works on real-time data.
- Machine learning model persistence. The new release now supports saving and loading pipelines and models across all programming languages supported by Spark.
- DataFrame-based machine learning APIs. Databricks says that with Spark 2.0, the spark.ml package, with its "pipeline" APIs, will emerge as the primary machine learning API. The original spark.mllib package is preserved in the new release, but Databricks says future development will focus on the DataFrame-based API.
- Standard SQL support. Spark 2.0 expands on Spark's SQL capabilities for SQL:2003 features, introduces a new ANSI SQL parser and supports scalar and predicate type subqueries.
"One of the things that's really exciting for me as a developer of Apache Spark is seeing how quickly users start to use new features and APIs we introduce, and in turn, offer almost instantaneous feeback, so that we can improve on them," Matei Zaharia, CTO and co-founder of Databricks and creator of Apache Spark, said in a statement Tuesday.
Sign up for CIO Asia eNewsletters.