Do you have a data warehouse that stores more than 300 petabytes of data and struggle with the latency of queries? Well, few companies have data at Facebook's scale, but the performance of queries against your data warehouse is often still a serious productivity issue. Facebook has been developing a solution to that problem and today it offered that answer up to the open source community. It calls the answer Presto.
Today, Facebook released Presto to the open source community under the Apache 2.0 license.
"Facebook's warehouse data is stored in a few large Hadoop/HDFS-based clusters," writes Martin Traverso, software engineer at Facebook, in a blog post Wednesday.
"Hadoop MapReduce and Hive are designed for large-scale, reliable computation, and are optimized for overall system throughput. But as our warehouse grew to petabyte scale and our needs evolved, it became clear that we needed an interactive system optimized for low query latency."
The Magic of Presto
Presto supports standard ANSI SQL, including complex queries, aggregations, joins and window functions. The engine was designed with a simple storage abstraction that, Traverso says, makes it easy to provide SQL query capability against HDFS, other well-known data stores like HBase and even custom systems like the Facebook News Feed backend. Storage plugins, which Facebook calls connectors, provide interfaces for fetching metadata, getting data locations and accessing the data itself.
"Presto is 10x better than Hive/MapReduce in terms of CPU efficiency and latency for most queries at Facebook," Traverso says.
"Presto is amazing," says Chris Gutierrez, data scientist at Airbnb, which is among the small number of external companies with which Facebook has already shared the Presto code and binaries. "A lead engineer got it into production in just a few days. It's an order of magnitude faster than Hive in most of our use cases. It reads directly from HDFS, so unlike Redshift, there isn't a lot of ETL before you can use it. It just works."
"We're really excited about Presto," adds Fred Wulff, a software engineer at Dropbox, which has also been testing the engine. "We're planning on using it to quickly gain insight about the different ways our users use Dropbox, as well as diagnosing problems they encounter along the way. In our tests so far it's been rock solid and extremely fast when applied to some of our most important ad-hoc use cases."
Facebook currently has more than one million active users that generate a never-ending stream of data. The company operates one of the largest data warehouses in the world-it stores more than 300 petabytes of data used for applications that range from traditional batch processing to graph analytics, machine learning and real-time interactive analytics.
Sign up for CIO Asia eNewsletters.