Apache Spark 2.0 is almost upon us. If you have an account on Databricks' cloud offering, you can get access to a technical preview today; for the rest of us, it may be a week or two, but by Spark Summit next month, I expect Apache Spark 2.0 to be out in the wild. [Editor's note: The preview release can now be downloaded from the Apache Spark site.] What should you look forward to?
During the 1.x series, the development of Apache Spark was often at a breakneck pace, with all sorts of features (ML pipelines, Tungsten, the Catalyst query planner) added along the way during minor version bumps. Given this, and that Apache Spark follows semantic versioning rules, you can expect 2.0 to make breaking changes and add major new features.
Unify DataFrames and Datasets
One of the main reasons for the new version number won't be noticed by many users: In Spark 1.6, DataFrames and Datasets are separate classes; in Spark 2.0, a DataFrame is simply an alias for a Dataset of type Row.
This may mean little to most of us, but such a big change in the class hierarchy means we're looking at Spark 2.0 instead of Spark 1.7. You can now get compile-time type safety for DataFrames in Java and Scala applications and use both the typed methods (map, filter) and the untyped methods (select,
groupBy) in both DataFrames and Datasets.
The all-new and improved SparkSession
A common question when working with Spark: "So, we have a SparkContext, a SQLContext, and a HiveContext. When should I use one and not the others?" Spark 2.0 introduces a new
SparkSession object that reduces confusion and provides a consistent entry point for computation with Spark. Here's what creating a
SparkSession looks like:
val sparkSession = SparkSession.builder
If you use the REPL, a
SparkSession is automatically set up for you as Spark. Want to read data into a DataFrame? Well, it should look somewhat familiar:
spark.read. json ("JSON URL")
In another sign that operations using Spark's initial abstraction of Resilient Distributed Dataset (RDD) are being de-emphasized, you'll need to get at the underlying
spark.sparkContext to create RDDs. Once again, RDDs aren't going away, but the preferred DataFrame paradigm is becoming more and more prevalent, so if you haven't worked with them yet, you will soon.
For those of you who have jumped into SparkSQL with both feet and discovered that sometimes you had to fight the query engine, Spark 2.0 has some extra goodies for you as well. There's a new SQL parsing engine which includes support for subqueries and many SQL 2003 features (though it doesn't claim full support yet), which should make porting legacy SQL applications to Spark a much more pleasant affair.
Sign up for CIO Asia eNewsletters.