According to a new benchmark, the three leading SQL-on-Hadoop engines — Apache Impala 2.3, Apache Spark 1.6 and Apache Hive 1.2 — all have unique strengths and weaknesses that make them well-suited to some Business Intelligence (BI) use cases and less suited to others.
"The conclusions really are that one engine does not meet all requirements," says Dave Mariani, CEO and founder of AtScale, a startup specializing in enabling BI on Hadoop. "What we have done in our deployments, for our customers, is plug in multiple engines."
For the Business Intelligence on Hadoop benchmark, AtScale set out to help technology evaluators select the best SQL-on-Hadoop technology for their BI use cases. AtScale's testing team used the Star Schema Benchmark (SSB) data set, based on widely used TPCH data, modified to more accurately represent a typical BI-oriented data layout. The data set allowed the test team to test queries across large tables: The lineorder table contains close to 6 billion rows and the large customer table contains over a billion rows.
SQL-on-Hadoop engines have different 'sweet spot' workloads
Mariani explains that AtScale looked at three key requirements to evaluate the SQL-on-Hadoop engines and their fitness to satisfy BI workloads:
- Performs on big data. SQL-on-Hadoop engines must be able to consistently analyze billions or trillions of rows of data without generating errors and with response times on the orders of 10s or 100s of seconds.
- Fast on small data. The engine needs to deliver interactive performance on known query patterns and, as such, it is important that the SQL-on-Hadoop engine return results in no greater than a few seconds on small data sets (on the order of thousands or millions of rows).
- Stable for many users. Enterprise BI user bases consist of hundreds or even thousands of data workers. The underlying SQL-on-Hadoop engine must perform reliably under highly concurrent analysis workloads.
Mariani, who led the effort to build what may have been the world's largest OLAP cube for BI at Yahoo!, says he believes these three criteria are representative of the primary requirements the average enterprise doing BI on Hadoop will have to meet. The criteria were drawn from the test team's experience working with a large number of companies in financial services, healthcare, retail, telecommunications and other industries.
"We used real-world enterprise experience to produce a document that every technical evaluator can use as part of their evaluation process," adds Josh Klahr, vice president of Product Management at AtScale.
The test team found that all three engines passed the tests and are stable enough to support BI workloads, but one engine does not fit all needs. Each has its own "sweet spot," and enterprises are likely to find that blended usage of all engines might fit their goals best.
Sign up for CIO Asia eNewsletters.