Apache Spark, the extremely popular data analytics execution engine, was initially released in 2012. It wasn't until 2015 that Spark really saw an uptick in support, but by November 2015, Spark saw 50 percent more activity than the core Apache Hadoop project itself, with more than 750 contributors from hundreds of companies participating in its development in one form or another.
Spark is a hot new commodity for a reason. Its performance, general-purpose applicability, and programming flexibility combine to make it a versatile execution engine. Yet that variety also leads to varying levels of support for the product and different ways solutions are delivered.
While evaluating analytic software products that support Spark, customers should look closely under the hood and examine four key facets of how the support for Spark is implemented:
- How Spark is utilized inside the platform
- What you get in a packaged product that includes Spark
- How Spark is exposed to you and your team
- How you perform analytics with the different Spark libraries
Spark can be used as a developer tool via its APIs, or it can be used by BI tools via its SQL interface. Or Spark can be embedded in an application, providing access to business users without requiring programming skills and without limiting Spark's utility through a SQL interface. I examine each of these options below and explain why all Spark support is not the same.
Programming on Spark
If you want the full power of Spark, you can program directly to its processing engine. There are APIs that are exposed through Java, Python, Scala, and R. In addition to stream and graph processing components, Spark offers a machine-learning library (MLlib) as well as Spark SQL, which allows data tools to connect to a Spark engine and query structured data, or programmers to access data via SQL queries they write themselves.
A number of vendors offer standalone Spark implementations; the major Hadoop distribution suppliers also offer Spark within their platforms. Access is exposed either through a command line or Notebook interface.
But performing analytics on core Spark with its APIs is a time-consuming, programming-intensive process. While Spark offers an easier programming model than, say, native Hadoop, it still requires developers.
Even for organizations with developer resources, deploying them to work on lengthy data analytics projects may amount to an intolerable hidden cost. With many organizations, programming on Spark is not an actionable course for this reason.
BI on Spark
Spark SQL is a standards-based way to access data in Spark. It has been relatively easy for BI products to add support for Spark SQL to query tabular data in Spark.
Sign up for CIO Asia eNewsletters.