The way that big data gets big is through a constant stream of incoming data. In high-volume environments, that data arrives at incredible rates, yet still needs to be analyzed and stored.
John Hugg, software architect at VoltDB, proposes that instead of simply storing that data to be analyzed later, perhaps we've reached the point where it can be analyzed as it's ingested while still maintaining extremely high intake rates using tools such as Apache Kafka.
— Paul Venezia
Less than a dozen years ago, it was nearly impossible to imagine analyzing petabytes of historical data using commodity hardware. Today, Hadoop clusters built from thousands of nodes are almost commonplace. Open source technologies like Hadoop reimagined how to efficiently process petabytes upon petabytes of data using commodity and virtualized hardware, making this capability available cheaply to developers everywhere. As a result, the field of big data emerged.
A similar revolution is happening with so-called fast data. First, let's define fast data. Big data is often created by data that is generated at incredible speeds, such as click-stream data, financial ticker data, log aggregation, or sensor data. Often these events occur thousands to tens of thousands of times per second. No wonder this type of data is commonly referred to as a "fire hose."
When we talk about fire hoses in big data, we're not measuring volume in the typical gigabytes, terabytes, and petabytes familiar to data warehouses. We're measuring volume in terms of time: the number of megabytes per second, gigabytes per hour, or terabytes per day. We're talking about velocity as well as volume, which gets at the core of the difference between big data and the data warehouse. Big data isn't just big; it's also fast.
The benefits of big data are lost if fresh, fast-moving data from the fire hose is dumped into HDFS, an analytic RDBMS, or even flat files, because the ability to act or alert right now, as things are happening, is lost. The fire hose represents active data, immediate status, or data with ongoing purpose. The data warehouse, by contrast, is a way of looking though historical data to understand the past and predict the future.
Acting on data as it arrives has been thought of as costly and impractical if not impossible, especially on commodity hardware. Just like the value in big data, the value in fast data is being unlocked with the reimagined implementation of message queues and streaming systems such as open source Kafka and Storm, and the reimagined implementation of databases with the introduction of open source NoSQL and NewSQL offerings.
Capturing value in fast data
The best way to capture the value of incoming data is to react to it the instant it arrives. If you are processing incoming data in batches, you've already lost time and, thus, the value of that data.
Sign up for CIO Asia eNewsletters.