Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Three incremental, manageable steps to building a “data first” data lake

Jack Norris, Senior VP Data and Applications, MapR Technologies | July 26, 2016
Instead of extracting, transforming and loading data into separate analytic clusters or data warehouses, converge data so all applications can use it in real time

Applications have always dictated the data. That has made sense historically, and to some extent, continues to be the case. But an “applications first” approach creates data silos that are causing operational problems and preventing organizations from getting the full value from their business intelligence initiatives.

For the last few decades, the accepted best practice has been to keep operational and analytical systems separate in order to prevent data analysis workloads from disrupting business operations. With this approach, any holistic analysis of the data stored in operational systems requires extracting, transforming and loading into separate analytic clusters or data warehouses. This requires additional resources, generates duplicate data, and takes considerable time, making it difficult or impossible to achieve the operational agility or algorithmic business processes recommended by Ernst & Young and Gartner, respectively.

A “data first” approach, by contrast, holds the promise of creating an infrastructure capable of capturing and consolidating all data into a converged data store or “data lake” where it can be accessed simultaneously and securely by many different applications in real-time as it becomes available. Such a converged architecture simplifies data management and protection, supports new applications that combine operations and analytics, and avoids the dreaded “multiple versions of the truth” phenomenon inherent in data silos.

Outlined here are three incremental and manageable steps any organization can take to begin implementing a data first strategy.

Step #1: Create a data lake. Start by creating a data lake, and include as many data sets and sources as possible. To minimize duplication, endeavor to make the data lake serve as a system of record for as many applications as practical by fully migrating their data sets. Then “complete” the data lake by replicating, as needed or desired, data from those existing applications whose data sets cannot be migrated—at least initially—for whatever reason. In other words, migrate what you can, and replicate what you must. To enable more holistic analyses, also be sure to include in the data lake those sources of data that are currently unused, but hold potential value.

While filling the data lake, be mindful of the requirements of any shared data environment, including satisfying the needs for a global namespace, unified security, high availability, high performance, multi-tenancy, data protection (replication, backup/restore and disaster recovery), etc. Of these requirements, the only one that might be new or substantially different with a data first data lake is the need for multi-tenancy. Because the consolidated and converged data will need to be shared simultaneously by different applications and users in different roles across different departments, it will be important to support the various “tenants” in a way that preserves data availability, security and integrity.


1  2  3  Next Page 

Sign up for CIO Asia eNewsletters.