Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Three incremental, manageable steps to building a “data first” data lake

Jack Norris, Senior VP Data and Applications, MapR Technologies | July 26, 2016
Instead of extracting, transforming and loading data into separate analytic clusters or data warehouses, converge data so all applications can use it in real time

To keep costs low while making the data lake scalable, consider using commodity hardware deployed in clusters. And to maximize the data lake’s ultimate potential, use open, standards-based software with published interfaces, plug-ins and other means for integrating with other applications, services and systems. Such an “open first” approach would give preference to technologies like Linux, KVM, Hadoop, Spark, Mesos and OpenStack, and would limit the use of any extensions or enhancement only to those based on applicable industry standards, such as SQL or NFS.

To avoid a setback, resist the temptation to take on too much data too soon. Even a partially-full data lake (think: reservoirs in California) can provide immediate benefits by offloading at least some data from data warehouses, Web analytics, databases, mainframes and other enterprise storage systems that are orders of magnitude more expensive. So start small, but think big.

Step #2: Begin using the data lake. The second step is to begin achieving those immediate benefits by identifying one or more new applications or uses cases that were previously impractical or impossible with disparate data sources. To maximize the potential for a successful first attempt, pick some low hanging fruit that will be easy to implement and impose minimal risk to the business. But also consider use cases that will be able to leverage a wide and deep data lake.

Examples for initial projects include integrating analytics into some operations, taking advantage of the lake’s increased data variety, volumes and/or velocities, and mining newly-available data sources. True, implementing a new application that utilizes new data sources will likely take more effort, but the rewards are likely to more meaningful to the business.

A good example that is common across virtually all industries is a “Customer 360” application that leverages both existing and new data. Keep it simple, though, at least initially, by using the app only to support a marketing campaign or enhance a CRM application.

After gaining some experience and competency with the data lake, give serious consideration to taking on some of the use cases that more fully leverage the breadth and depth of its data, especially those applications that enhance revenues, reduce costs, streamline operations, mitigate risk and/or address security needs.

Step #3: Make the data lake real-time. The third step involves putting the data lake to the test with real-time applications. Getting actionable insights in real-time is something siloed architectures struggle to do and, therefore, holds the potential for maximizing the return on the investment in the Data First strategy.

Real-time functionality is at the core of the many new transformational applications that need to be able to perform analytics directly on operational data as it becomes available. These applications are normally unique to each industry, with visible early adopters in the retail, financial services and telecommunications sectors. But what they all share is the need for speed, versatility and extensibility to accommodate diverse requirements, groups and business functions—all of which embody what a data lake is designed to do.

 

Previous Page  1  2  3  Next Page 

Sign up for CIO Asia eNewsletters.