Elephants never forget, they say, though I doubt pachyderms are the savants the proverb has led us to believe. I know a specific elephant — named Hadoop — who can't seem to remember the recent history of the EDW (enterprise data warehouse) market upon which it's encroaching. Specifically, some in the Hadoop arena seem to be repeating some aspects of the positioning overreach that long bedeviled that market.
I'm referring to the dubious notion that Hadoop can and should be the central consolidation hub for all your business' analytic data.
For years, before the big data era started in earnest, the EDW arena pushed this "all in one basket" notion. Though the notion of a single-version-of-the-truth repository for all analytical subject domains makes sense in the abstract, few customers saw any compelling need to spend the money, time, and resources to consolidate disparate analytic databases onto a single platform. Many companies have consolidated some core system-of-record data in EDWs, but it's still common everywhere to see enterprises dedicate tactical data warehouses, data marts, operational data stores, OLAP cubes, and other analytic databases for specific regions, lines of business, applications, and users.
Resistance to the concept of a single "enterprise data hub" will endure in the age of Hadoop. In fact, you can read that skepticism in the tone of Loraine Lawson's recent article on an equivalent dream — that of a Hadoop-centric "data lake." Lawson likens the concept to that of a "Big Rock Candy Mountain," a "data-centered architecture, where distributed computing comes trickling down the rock and they hung the jerk that invented data silos." Citing Edd Dumbill's "data lake" discussion, she says, "And to prove it's more than just a developer's dream, he points out that Google and Facebook developers 'live the dream fully.'"
I don't get the logic of Dumbill's statement. Doesn't pointing to developers confirm it is indeed just a developer's dream? And singling out developers at two firms that were among Hadoop's earliest developers and users, and whose companies have built their respective Web services on that platform, doesn't show that this dream lives outside Silicon Valley.
In fact, the zeitgeist among actual users in the big data era has begun to shift toward a "hybrid" deployment model that blends EDW, Hadoop, NoSQL, in-memory, and other data platforms within a heterogeneous, cloud-enabling infrastructure.
Within the context of a hybrid architecture, this "data lake" dream seems to be specific to one big data deployment role: an exploratory "sandbox" that is the data-consolidation and statistical-modeling hub for teams of data scientists who need to sift through petabytes of multi-structured data. Data scientists everywhere are flocking to Hadoop as their all-data "sandbox," as I previously discussed.
Sign up for CIO Asia eNewsletters.