If this sounds a lot like Jupyter for Python, that is one of the metaphors IBM had in mind -- and in fact, Jupyter notebooks are a supported format. What's new is that IBM is trying to expose Spark (and the rest of its service mix) in a way that complements Spark's vaunted qualities -- its overall ease of use and lowering of the threshold of entry for prospective data scientists.
Cloud data warehouse startup Snowflake is making Spark a standard-issue component as well. Its original mission was to provide analytics and data warehousing that spared the user from the hassle of micromanaging setup and management. Now, it's giving Spark the same treatment: Skip the setup hassles and enjoy a self-managing data repository that can serve as a target for, or recipient of, Spark processing. Data can be streamed into Snowflake by way of Spark or extracted from Snowflake and processed by Spark.
Spark lets Snowflake users interact with their data in the form of a software library rather than a specification like SQL. This plays to Snowflake's biggest selling point -- automated management of scaling data infrastructure -- rather than merely providing another black-box SQL engine.
With Databricks, the commercial outfit that spearheads Spark development and offers its own hosted platform, the question has always been how it can distinguish itself from other platforms where Spark is a standard-issue element. The current strategy: Hook 'em with convenience, then sell 'em on sophistication.
Thus, Databricks recently rolled out the Community Edition, a free tier for those who want to get to know Spark but don't want to monkey around with provisioning clusters or tracking down a practice data set. Community Edition provides a 6GB microcluster (it times out after a certain period of inactivity), a notebook-style interface, and several sample data sets.
Once people feel like they have a leg up on Spark's workings, they can graduate to the paid version and continue using whatever data they've already migrated into it. In that sense, Databricks is attempting to capture an entry-level audience -- a pool of users likely to grow with Spark's popularity. But the hard part, again, is fending off competition. And as Spark is open source, it's inherently easier for someone with far more scale and a far greater existing customer base to take all that away.
If there's one consistent theme among these moves, especially as Spark 2.0 looms, it's that convenience matters. Spark caught on because it made working with gobs of data far less ornery than the MapReduce systems of yore. The platforms that offer Spark as a service all have to assume their mission is twofold: Realize Spark's promise of convenience in new ways -- and assume someone else is also trying to do the same, only better.
Sign up for CIO Asia eNewsletters.