Aiming to provide data engineers with new and better tools for creating production data pipelines, Databricks yesterday released Databricks for Data Engineering, a new version of its Apache Spark-based cloud platform optimized specifically for data engineering workloads.
Databricks, founded by the creators of Apache Spark, already provides a version of the cloud platform geared toward supporting data science workloads. But Databricks CEO and Co-founder Ali Ghodsi says the overwhelming majority of the company's nearly 500 enterprise customers and 50,000 community edition users are seeking to combine SQL, structured streaming, ETL and machine learning workloads running on Spark to deploy data pipelines into production.
Cleaning fuzzy data
"What they really are doing is taking data that is maybe skewed, fuzzy, maybe has errors in it, and they're using Spark to create a pipeline that cleans the data and puts it in structured form," Ghodsi says. "That's really the main use case that we saw. They're using the interactive APIs to explore their data sets, but once they explore it, they're turning it into production data pipelines where there's no human in the loop."
Ghodsi notes that building these pipelines with Databricks for Data Engineering is much more cost-effective than with the existing Databricks offering, representing 50 percent to 75 percent cost savings.
Features of the new Databricks for Data Engineering offering include the following:
- Performance optimization. Databricks I/O (DBIO) technology provides a tuned and optimized version of Spark for a wide variety of instance types, in addition to an optimized AWS S3 access layer. Databricks says DBIO accelerates data exploration by up to 10x.
- Cost management. Cluster management capabilities, such as auto-scaling and AWS Spot instances reduces operational costs by avoiding time-consuming tasks to build, configure and maintain complex Spark infrastructure. "It automatically determines the best number of machines to compute your workload," Ghodsi says. "We've seen a lot of people have a lot of machines on all the time. They have a hard time figuring out how many machines they should be using for their workloads."
- Optimized integration. The platform provides a set of REST APIs to programmatically launch clusters and jobs and integrate tools or services ranging from Amazon Redshift and Amazon Kinesis to machine learning frameworks like Google's TensorFlow. An integrated data sources catalog makes the data sources immediately available to Databricks users without duplicating data ingest work.
- Enterprise security. Databricks for Data Engineering includes turnkey security standards including SOC 2 Type 1 certification and HIPAA compliance, end-to-end data encryption, detailed logs accessible in AWS S3 for debugging and IT admin capabilities like Single Sign-On with SAML 2.0 support and role-based access controls for clusters, jobs and notebooks.
- Collaboration with data science. The platform is integrated with the data science workspaces in Databricks, enabling a seamless transition between data engineering and interactive data science workloads.
Sign up for CIO Asia eNewsletters.