Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

When to use Hadoop (and when not to)

Chris Nerney | Aug. 14, 2014
Hadoop is a huge advancement in big data technology, but there are better choices for real-time analytics.

When enterprises interested in leveraging big data and analytics ask how to get started, they often are advised to begin with Hadoop, Apache Software's open source data storage and processing framework.

There are a number of reasons why Hadoop is an attractive option. Not only does the platform offer both distributed computing and computational capabilities at a relatively low cost, it's able to scale to meet the anticipated exponential increase in data generated by mobile technology, social media, the Internet of Things, and other emerging digital technologies.

These advantages, along with strong word of mouth and high-profile implementations by companies such as Facebook, Yahoo, and numerous Fortune 50 giants, is driving adoption of Hadoop.

Research firm Researchbeam in March forecast the global Hadoop market to grow to $50 billion in 2020 from $1.5 billion in 2012. Most of that money will be spent on services provided by commercial Hadoop specialists such as Cloudera, Hortonworks, and MapR Technologies.

But not all data scientists are climbing on board the Hadoop train. In fact, many have jumped off. In a recent survey of data scientists on the obstacles to big data analytics, vendor Paradigm4 reports that more than three-quarters (76%) of the scientists who said they have used Hadoop or Spark (the computational framework built on top of the Hadoop distributed file system) cite "significant limitations" to their use.

Specifically, 39% of respondents said Hadoop takes too much effort to program, while 37% said it was "too slow for interactive, ad hoc queries." Another 30% knocked Hadoop as being too slow for real-time analytics. And more than one-third (35%) of data scientists surveyed who have used Hadoop and Spark said they have stopped using them.

Granted, this survey is from a vendor that's offering "more" than Hadoop. But the reasons given by respondents explaining their dissatisfaction with Hadoop are grounded in real issues rather than vendor hype.

Take response time. If you're looking to produce complex analytics or real-time analytics, Hadoop probably isn't the platform for you, explains Claudia Perlich, chief  scientist for Dstillery, a marketing company that crunches web browsing data to help brands target ads.

For the part of Dstillery's business that delivers ads online, real-time analytics are essential. "That part," Perlich says, "we can't do with Hadoop."

"If I have 30 milliseconds to look up information in a database that has 300 million people, there's no way Hadoop can do it," she says. "It's not the technology for quick access."

However, Dstillery also performs analytical services for which response time takes a back seat to accuracy and long-term insights.

"All of our incoming data is dumped into Hadoop to use for building analytics," Perlich says. "We do a lot of predictive modeling, and this is where Hadoop is phenomenal, particularly the cost at which you can store everything and access it in reasonable time -- not real time, but reasonable time."


1  2  Next Page 

Sign up for CIO Asia eNewsletters.