Firstly, it's important to note that Hadoop and Spark are broadly different technologies, with different use cases. The Apache Software Foundation, from which both technologies emerged, even places the two in different categories: Hadoop is a database, Spark is a big data tool.
In Apache's own words, Hadoop is: a"distributed computing platform": "A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer."
In a large majority of cases when someone talks about Hadoop they mean the Hadoop Distributed File System (HDFS) which is "a distributed file system that provides high-throughput access to application data". Then there is Hadoop YARN, a job scheduling and cluster resource management tool, and Hadoop MapReduce for parallel processing of large data sets.
Spark, on the other hand, is: "A fast and general engine for large-scale data processing. It offers high-level APIs in Java, Scala and Python as well as a rich set of libraries including stream processing, machine learning, and graph analytics."
So how do they come together?
Both are big data frameworks. Basically, if you're a company with a fast-growing pool of data, Hadoop is open source software that will allow you to store this data in a reliable and secure way. Spark is a tool for understanding that data. If Hadoop is the Bible written in Russian, Spark is a Russian dictionary and phrasebook.
You can run Spark for your big data projects on HDFS, or another file system, such as NoSQL databases with Spark connectors like MongoDB, Couchbase or Teradata (see: vendors).
Will Gardella, director of product management at Couchbase, says: "When it comes to Hadoop you don't have a single technology, it is a giant family of technologies and at the bottom you have the distributed file network that everyone loves. HDFS solves unglamorous, difficult problems well and it lets you store as much stuff as you want and not worry about stuff getting corrupted. That part people rely on."
The choice really comes down to what you want to do with your data and the skill set of your IT staff. Once your data is in Hadoop there are lots of ways to extract value from it. You can go down the standard analytics route of plugging a tool into the data lake for data cleansing, querying and visualisation.
Big players in the analytics and business intelligence market like Splunk offer Hadoop-integrated products, and data-visualisation firms like Tableau will let you present this data back to non-data people.
Sign up for CIO Asia eNewsletters.