Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

How to get your mainframe's data for Hadoop analytics

Andrew C. Oliver | July 1, 2016
IT's mainframe managers don't want to give you access but do want the mainframe's data used. Here's how to square that circle

Many so-called big data -- really, Hadoop -- projects have patterns. Many are merely enterprise integration patterns that have been refactored and rebranded. Of those, the most common is the mainframe pattern.

Because most organizations run the mainframe and its software as a giant single point of failure, the mainframe team hates everyone. Its members hate change, and they don't want to give you access to anything.

However, there is a lot of data on that mainframe and, if it can be done gently, the mainframe team is interested in people learning to use the system rather than start from the beginning. After all, the company has only begun to scratch the surface of what the mainframe and the existing system have available.

There are many great techniques that can't be used for data integration in an environment where new software installs are highly discouraged, such as in the case of the mainframe pattern. However, rest assured that there are a lot of techniques to get around these limitations.

Sometimes the goal of mainframe-Hadoop or mainframe-Spark projects is just to look at the current state of the world. However, more frequently they want to do trend analysis and track changes in a way that the existing system doesn't do. This requires techniques covered by change data capture (CDC).

Technique 1: Log replication

Database log replication is the gold standard. There are a lot of tools like this. They require an install on the mainframe side and a receiver either on Hadoop or nearby.

All the companies that produce this software tell you that there is no impact on the mainframe. Do not repeat any of the nonsense the salesperson says to your mainframe team, as they will begin to regard you with a very special kind of disdain and stop taking your calls.

After all, it is software, running on the mainframe, so it consumes resources and there is an impact.

The way log replication works is simple: DB2 (or your favorite database) writes redo logs as it writes to a table, the log-replication software reads that and deciphers it, then it sends a message (like a JMS, Kafka, MQSeries, or Tibco-style message) to a receiver on the other end that writes it to Hadoop (or wherever) in the appropriate format.Frequently, you can control this from having a single write to batches of writes.

The advantage is that this technique gives you a lot of control over how much data gets written and when. It doesn't lock records or tables, but you get good consistency. You can also control the impact on the mainframe.

The disadvantage is that it is another software install.This usually takes a lot of time to negotiate with the mainframe team. Additionally, these products are almost always expensive and priced in part on a sliding scale (companies with more revenue get charged more even if their IT budget isn't big).


1  2  3  Next Page 

Sign up for CIO Asia eNewsletters.