Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

How to work with firehose data

Brian Proffitt | Jan. 31, 2012
It goes without saying that big data is, well, big. But it's not just the size that's an obstacle when dealing with big data: it's the rate at which that data can be coming into your data storage infrastructure.

The final hurdle is dealing with component failure. "All components fail," Berkus declared, "but collection of data must continue - even if the network fails."

These challenges are the ones Berkus and his team meet head on when they work on new projects. One such project PostgreSQL Experts tackled recently was Upwind, a wind farm management company.

Windmsills, it seems, generate a lot of data. You might say a hurricane's worth: speed, wind speed, heat, and vibration are just some of the data generated every second by those massive power generators that dot landscapes across the planet. In fact, Berkus explained, each turbine can push out 90 to 700 facts-per-second. With 100 turbines per farm, and Upwind managing 40 plus such farms, this gives the company upwards of 300,000 facts per second with which to contend.

Complicating this is that Upwind is working with multiple turbine owners, who want their own windmills' data separately and may have different algorithms and analytics to measure that data - techniques they assuredly do not want their competition to have.

Berkus described to his audience a series of solutions that featured very elastic connections to handle the peaks and valleys of incoming data and a parallel infrastructure to deal with the multi-tenant requirements. The systems described by Berkus in his talk highlighted the need to deal with out-of-order data (connections from the turbines to the datacenter can and do fail occasionally), as well as being extremely fault-tolerant.

Naturally, Berkus' described solution had multiple PostgreSQL nodes at its heart, but there were also Hadoop nodes in place to manage the multiple time-based rollup tables in the system to work with aggregate data.

Berkus summarized his discussion with five key elements that a firehouse data system had to have:

  • Queuing software to manage out-of-sequence data
  • Buffering techniques to deal with component outages
  • Materialized views that update data into aggregate tables
  • Configuration management for all the systems in the solution
  • Comprehensive monitoring to look for failures

With these key principles in place, firehose data management is a survivable event for any business that needs to handle a constant flow of data without getting knocked down.

 

Previous Page  1  2 

Sign up for CIO Asia eNewsletters.