Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Are you a data hoarder? Hadoop offers little choice

Andrew C. Oliver | Aug. 28, 2015
Data governance is one of the toughest, dreariest problems in computing. Sadly, the tools offered with the major Hadoop distributions aren't really up to task.

Apache Atlas

Hortonworks is newer to the data governance game. It has proposed Apache Atlas, which was accepted into Apache's incubator -- sometimes, but not always, a sign of project maturity. The rise to a top-level Apache project is a very political process.

Atlas has high hopes, but it's pretty early on in its development. It integrates with Apache Ranger according to the README.txt, though that's the only use of the word "Ranger" in the whole source repository, and it isn't a lot of code. While Atlas is part of Hortonworks' recent 2.3 release, it's clearly an early cut, and probably not the core of your master-data-management or data governance initiative at this point.

The buyer's lament

With Sentry versus Ranger and Navigator versus Atlas, you're seeing a real split. On one hand Cloudera offers a mature more complete offering; on the other hand it's proprietary and already diverging from the less mature, less-thought-out Sentry product. Hortonworks answers with an open source offering, but obviously, it integrates with its own preferred security implementation.

In other words, we're seeing a sort of Hadoop distribution lock-in with each new layer we add. Part of why we pick an open source technology is to put the choice back in the user's hands.

Neither Navigator nor Atlas are particularly complete offerings, and while it's nice that Navigator can work with existing data governance offerings such as Informatica, these have their own plug-ins, anyhow.

You have to ask: Do I need a Hadoop data governance solution or do I need a complete data governance solution that includes Hadoop? In many cases, I'd say the latter.

It would be nice to see full-on open source data governance software. But for now, if you look at a complete, mature, and proprietary tool like Collibra, which offers a complete vision, you're unlikely to be happy even with Navigator. It would probably easier for Collibra to deepen its Hadoop integration and offer better data lineage than for Cloudera to make Navigator a more complete offering. If you're using a proprietary product anyhow, you might as well use a complete one that covers all of your data (and if you have a lot of it, you probably have Informatica anyhow).

Someday a complete open source data governance or master data management tool will emerge. But it can't be aligned with a single technology vertical. I mean, I don't really want Data Governance for Hadoop, Data Governance for MongoDB, Data Governance for Oracle and a freaking data lake project just to tie back together my metadata from my data governance tools.

The catch with many existing tools is they are heavy duty and suited to bureaucratic organizations that hold long-winded data governance committee meetings. For organizations just getting into data governance, who simply need to stop digging, the implementation costs can be daunting.

 

Previous Page  1  2  3  Next Page 

Sign up for CIO Asia eNewsletters.