While Hive is generally considered the default for SQL-on-Hadoop, it was far and away the slowest of the engines in the benchmark, making it poorly suited to interactive queries.
"If you want to use Hive Tez as your interactive query engine exclusively, the best you're going to do is 2.4 seconds," Mariani says.
But while it may be slow, Hive is also the most stable of the three engines, with the best consistency across multiple query types.
"Hive Tez is the tortoise," Mariani adds. "It will always finish the race, but not in a spectacular, speedy fashion. It's the most reliable."
Impala and Spark, on the other hand, were at their best when it came to smaller data sets. Impala topped Spark across a gamut of workloads, but Mariani notes that Spark 1.6 was a vast performance improvement over Spark 1.5 and he expects that trend to continue as Spark has drawn a large open source community focused on its development. Cloudera recently proposed donating Impala to the Apache Software Foundation, which could also lend additional momentum to its development.
For now, Impala is the king for use cases that require large numbers of users.
"Impala kicks butt when it comes to concurrency," Mariani says. "If you're going to have a whole bunch of users running small, fast queries, Impala is a much better choice than Spark would be."
"If speed is not a priority, but stability and reliability is, I would choose to Use Hive Tez as my data pipeline engine," he adds. "For those big batch workloads I would choose Hive Tez. If I wanted my BI users to get access to my warehouse, I would choose to use Spark or Impala."
Mariani notes that while the team didn't benchmark other engines like Apache Drill or Apache Presto, they will next time.
"You never know between release and release who's going to be the better horse to bet on," he says.
Sign up for CIO Asia eNewsletters.