Structured Streaming is likely to be the new feature that everybody is excited about in the weeks and months to come. With good reason! I went into a lot of detail aboutwhat Structured Streaming is a few weeks ago, but as a quick recap, Apache Spark 2.0 brings a new paradigm for processing streaming data, moving away from the batched processing of RDDs to a concept of a DataFrame without bounds.
This will make certain types of streaming scenarios like change-data-capture and update-in-place much easier to implement -- and allow windowing on time columns in the DataFrame itself instead of when new events enter the streaming pipeline. This has been a long-running thorn in Spark Streaming's side, especially in comparison to competitors like Apache Flink and Apache Beam, so this addition alone will make many happy to upgrade to 2.0.
Much effort has been spent on making Spark run faster and smarter in 2.0. The Tungsten engine has been augmented with bytecode optimizers that borrow techniques from compilers to reduce function calls and keep the CPU occupied efficiently during processing.
Parquet support has been improved, resulting in a 10-fold speed-up in some cases, and the use of Encoders over Java or Kryo serialization, first seen in Spark 1.6, continues to reduce memory usage and increase throughput in your cluster.
If you're expecting big changes in the machine learning and graphing side of Spark, you might be a touch disappointed. The important change to Spark's machine learning offerings is that development in the
spark.mlliblibrary is frozen. You should instead use the DataFrame-based API in
spark.ml, which is where development will be concentrated going forward.
Spark 2.0 brings full support for model and ML pipeline persistence across all of its supported languages and makes more of the MLLib API available to Python and R for all of your data scientists who recoil in terror from Java or Scala.
As for GraphX, it seems to be a bit unloved in Spark 2.0. Instead, I'd urge you to keep an eye on GraphFrames. Currently a separate release from the main distribution, this builds a graph processing framework on top of DataFrames that is accessible from Java, Scala, Python, and R. I wouldn't be surprised if this UC Berkeley/MIT/Databricks collaboration finds its way into Spark 3.0.
Say hello, wave good-bye
Of course, a new major version number is a great time to make breaking changes. Here are a couple of changes that may cause issues:
- Dropping support for versions of Hadoop prior to 2.2
- Removing the Bagel graphing library (the pre-cursor to GraphX)
An important deprecation that you will almost certainly run across is the renaming of
registerTempTable in SparkSQL. You should use
createTempView instead, which makes it clearer that you're not actually materializing any data with the API call. Expect a gaggle of deprecation notices in your logs from this change.
Sign up for CIO Asia eNewsletters.