In his keynote at Spark Summit 2014 in San Francisco today, Databricks CEO Ion Stoica unveiled Databricks Cloud, a cloud platform built around the Apache Spark open source processing engine for big data.
Spark, which got its v 1.0 release just one month ago, is a cluster computing framework designed to sit on top of Hadoop Distributed File System (HDFS) in place of Hadoop MapReduce. With support for in-memory cluster computing, Spark can achieve performance up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk.
Spark can be an excellent compute engine for data processing workflows, advanced analytics, stream processing and business intelligence/visual analytics. But Spark clusters can be difficult beasts, Stoica says. Databricks hopes to change all that with its hosted Databricks Cloud platform as a turnkey solution.
"Getting the full value out of their big data investments is still very difficult for organizations," Stoica says. "Clusters are difficult to set up and manage, and extracting value from your data requires you to integrate a hodgepodge of disparate tools, which are themselves hard to use. Our vision at Databricks is to dramatically simplify big data processing and free users to focus on turning data into value. Databricks Cloud delivers on this vision by combining the power of Spark with a zero-management hosted platform and an initial set of applications built around common workflows."
Databricks Cloud provides support for interactive queries (via Spark SQL), streaming data (Spark Streaming), machine learning (MLlib) and graph computation (GraphX) natively with a single API across the entire data pipeline. Stoica says that provisioning new Spark clusters is a snap: just specify the desired capacity of the cluster and the platform handles everything else provisioning servers on the fly, streamlining import and caching of data, security and patching and updating Spark.
The platform comes with three built-in applications:
Notebooks. A rich interface for performing data discovery and exploration, Notebooks can plot results interactively, execute entire workflows as scripts and enables advanced collaboration features.
Dashboards. Dashboards allows users to create and host dashboards by picking any outputs from previously created notebooks. Dashboards then assembles the outputs in a one-page dashboard with a WYSIWYG editor that can be published to a broader audience.
Job Launcher. The Job Launcher application enables anyone to run arbitrary Apache Spark jobs and trigger their execution, simplifying the process of building data products.
"One of the common complaints we heard from enterprise users was that big data is not a single analysis; a true pipeline needs to combine data storage, ETL, data exploration, dashboards and reporting, advanced analytics and creation of data products," Stoica says. "Doing that with today's technology is incredibly difficult. We built Databricks Cloud to enable the creation of end-to-end pipelines out of the box while supporting a full spectrum of Spark applications for enhanced and additional functionality. It was designed to appeal to a whole new class of users who will adopt big data now that many of the complexities of using it have been eliminated."
Sign up for CIO Asia eNewsletters.