The dialect of SQL used by Spark is similar to that of Apache Hive, making Spark SQL akin to earlier SQL-on-Hadoop technologies.
Although Spark SQL uses the Spark engine behind the scenes, it suffers from the same disadvantages as Hive and Impala: Data must be in a structured, tabular format to be queried.
This forces Spark to be treated as if it were a relational database, which cripples many of the advantages of a big data engine. Simply put, putting BI on top of Spark requires the transformation of the data into a reasonable tabular format that can be consumed by the BI tools.
Another way to leverage Spark is to abstract away its complexity by embedding it deep into a product and taking full advantage of its power behind the scenes.This allows users to leverage the speed and power of Spark without needing developers.
This architecture brings up three key questions. First, does the platform truly hide all of the technical complexities of Spark? As a customer, one needs to examine all aspects of how you would create each step of the analytic cycle -- integration, preparation, analysis, visualization, and operationalization.
A number of products offer self-service capabilities that abstract away Spark's complexities, but others force the analyst to dig down and code -- for example, in performing integration and preparation. These products may also require you to first ingest all your data into the Hadoop file system for processing.
This adds extra length to your analytic cycles, creates fragile and fragmented analytic processes, and requires specialized skills.
Second, how does the platform take advantage of Spark?It's critical to understand how Spark is used in the execution framework. Spark is sometimes embedded in a fashion that does not have the full scalability of a true cluster. This can limit overall performance as the volume of analytic jobs increases.
Third, how are you protected for the future? The strength of being tightly coupled with the Spark engine is also a weakness. The big data industry moves quickly. MapReduce was the predominant engine in Hadoop for six years. Apache Tez became mainstream in 2013, and now Spark has become a major engine.
Assuming the technology curve continues to produce new engines at the same rate, Spark will almost certainly be supplanted by a new engine within 18 months, forcing products tightly coupled to Spark to be reengineered -- a far from trivial undertaking. Even with that effort put aside, you must consider whether the redesigned product will be fully compatible with what you've built in the older version.
The first step to uncovering the full power of Spark is to understand that not all Spark support is created equal. It's crucial that organizations grasp the differences in Spark implementations and what each approach means for their overall analytic workflow. Only then can they make a strategic buying decision that will meet their needs over the long haul.
Sign up for CIO Asia eNewsletters.