Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

How to work with firehose data

Brian Proffitt | Jan. 31, 2012
It goes without saying that big data is, well, big. But it's not just the size that's an obstacle when dealing with big data: it's the rate at which that data can be coming into your data storage infrastructure.

It goes without saying that big data is, well, big. But it's not just the size that's an obstacle when dealing with big data: it's the rate at which that data can be coming into your data storage infrastructure.

Not too long ago, before the age of data automation, data would typically come into an organization at predetermined, set times: a good example would be business hours. When data was entered from nine to five, it grew at perfectly predictable rates, and was accessed and analyzed at equally predictable rates. Even better, the downtime that occurred when everyone went home at night would enable the night-owl DBAs to make updates and repairs to the database in question

There may have even been - are you sitting? - overtime pay in it for them.

Many businesses still work with data in this manner, some even exclusively. (Gone, in many cases, is this strange word known as "overtime.") But more and more often, data is coming in from automated sources that don't have downtime, and could be firing data to an organization every second of every day. And significant amounts, at that.

This, then, is what the data gurus call firehose data - a steady and powerful stream of data that your IT infrastructure may be required to manage, and when all is said and done, actually use for business decisions.

According to Josh Berkus, CEO of PostgreSQL Experts Inc., there are four inherent challenges of working with firehose data. Berkus addressed those characteristics in a talk Jan. 22 at the Southern California Linux Expo.

First, the firehouse will have a lot of volume: anywhere from hundreds to thousands of facts-per-second. That volume may not be a steady rate, Berkus added, as it can have spikes, come from multiple uncoordinated sources, and may grow over time.

The second challenge is that, while the rate of volume can vary, the flow itself will be nearly constant, arriving on a 24/7 cycle. This means DBAs can't stop their systems to process the data, nor bring down an entire infrastructure for maintenance. This, and the fact that data can also arrive out of order, means extract, transform, and load (ETL) operations are pretty much not happening.

The third obstacle, Berkus told his audience, was that the database itself was going to be large - with multiple terabytes to petabytes of data to be handled.

"This means a lot of hardware," Berkus said, "because single-node database management systems aren't going to be enough." It's not just catching the data, either - analytic operations on data sets this big are also extremely resource-intensive. And, complicating this issue is that issue of database growth, "because no one ever wants to throw data away."

 

1  2  Next Page 

Sign up for CIO Asia eNewsletters.