There's a bit of absurdity here. If you throw it away, you can't get it back; if you keep it, you can eventually organize and purge what you don't need. Those who store data now while getting their governance in place are not automatically "data hoarders." This is a false dilemma.
The idea that you need to come up with a perfect plan before keeping any data or bringing in any new sources is a little like saying we need perfect social justice for everyone before we can address police killings of African-Americans.
Instead, get started now. Stop throwing out the baby with the bathwater and begin finding your use cases. Meanwhile, make data the point rather than a side effect of your processes and govern it accordingly. These aren't "steps," but initiatives you need to undertake, usually in parallel.
That said, how do you go about planning? How do you start cataloging your data and establish some structure around its evolution? There are traditional solutions like those covered in last year's Forrester Wave report -- Informatica, various IBM offerings, SAS, and Collibra, among others -- but some of these come with a lot of baggage and form part of a vendor's overall platform play.
Meanwhile, a new class of data governance tools is being developed specifically for Hadoop. These tools have less of a legacy, but are also less mature. They are focused on the Hadoop ecosystem rather than your whole organization, allowing you to integrate them more closely with your new data architecture.
Navigator is Cloudera's closed-source offering for data governance. It incorporates both security auditing and metadata management, and it allows both integration with traditional data governance products like Informatica and automated data lineage tracking.
At its core, it tracks where data came from, what transformations happened to it, where the data landed, and where the heck it's located. You can even set up rules (policies) for automatically tagging data based on its type and origin.
Navigator also allows you to trigger actions based on these policies, some of which aren't necessarily best done in Navigator (for example, triggering actions to archive or move data). Among the biggest concerns is that you can trigger auditing with or without Sentry, Cloudera's authorization module for Hadoop.
On the one hand, "choice is good," but on the other hand, if you go to the condiment counter at a fast food joint and find 15 brands of generic ketchup packages, which do you choose? I don't really need multiple paths for an audit implementation because...I just want to log the stuff already and I don't care about choice for that.
Sign up for CIO Asia eNewsletters.