Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Hadoop vs Spark: Which is right for your business? Pros and cons, vendors, customers and use cases

Scott Carey | July 7, 2016
It's important to note that Hadoop and Spark are broadly different technologies, with different use cases

Spark on the other hand, as Gardella says, is useful "if you want to give your staff access to genuine real-time data with the intention that they, or an algorithm, will make decisions off the back of that data."

If your data is simply a large amount of structured data, such as a database of medical records, then the streaming capabilities of Spark aren't strictly necessary.

Hadoop vs Spark: Pros and Cons

Reliability: One major benefit with Hadoop is that because it's a distributed platform it's less prone to failure, allowing underlying data to be always available. This is why it is the chosen database of many web companies, because the internet never sleeps.

Cost: Hadoop and Spark are both projects from the Apache Software Foundation, so they are free and open source. The price comes from how you want to implement it, total cost of ownership, the time and resource related to implementation due to the skills required and the hardware. This also makes it highly scalable as your data lake grows.

The licensing model of traditional database providers like Oracle and SAP has long been the bane of many CIOs, so the Software-as-a-Service model provided by most of the Hadoop/Spark specialists gives greater flexibility while you figure out if the technology is useful.

Speed: Spark is reported to run up to 100 times faster than Hadoop MapReduce, according to the Apache foundation. This is because Spark works in-memory rather than reading and writing to and from hard drives. MapReduce will read data from the cluster, perform an operation and write the results back to the cluster, which takes time, whereas Spark performs this process in one place.

Generality: Couchbase's Gardella says: "Spark can load data from every place: Couchbase, MySQL, Amazon S3, HDFS. All of the formats you expect load out of HBase. It makes Spark very versatile."

Skills: Whatever the vendors tell you, Spark is not an easy tool to use. It is intended for data analysts and experts and is generally applied to deeply complex and constantly changing streaming data sets.

Hadoop vs Spark: Use Cases

Due to its ability to store more and more data, some classic Hadoop use cases include a 360-degree view of your customers, recommendation engines for retailers and security and risk management.

Retailers and Internet of Things companies are also interested in Spark, however, because of its ability to conduct real-time interactive data analytics to deliver greater personalisation.

According to MongoDB's VP of strategy Kelly Stirman, Spark's growing popularity is due to its compatibility with one important use case: machine learning.

Stirman tells Computerworld UK: "Ten years into Hadoop and the hallmarks are still promising but most people have found it hard to use and not well suited to artificial intelligence and machine learning."


Previous Page  1  2  3  4  Next Page 

Sign up for CIO Asia eNewsletters.