Campisi equates this role to actuaries in the insurance industry — those "data scientists of their time" who analyzed data and came up with models or made predictions. "Now every industry is going to have that actuarial type of person that we now call data scientists, who just work at connecting and stitching together this information," he says. "They'll try and find some relationship that no one's thought of, or some curve that's very valuable to know but that no one else has found the formula for yet."— Stacy Collett
"We have a column in the database called 'State' on every single person's record." But in a database of 300 million registered voters, "it only appears in our database 50 times," he says. "In [row-based open-source relational database management systems like] Postgres and MySQL, it appeared 300 million times. So if you replicate that level of compression on everything from street names to the last name Smith, that plus other compression algorithms buys you tremendous savings in terms of storage space. So your choice of database technology really does affect how much storage you need."
On the storage side, deduplication, compression and virtualization continue to help companies reduce the size of files and the amount of data that is stored for later analysis. And data tiering is a well-established option for bringing the most critical data to analytics tools quickly.
Solid-state drives (SSD) are another popular storage medium for data that must be readily available. Basically a flash drive technology that has become the top layer in data tiering, SSDs keep data in very fast response mode, Csaplar says. "SSDs hold the data very close to processors to enable the servers to have the I/O to analyze the data quickly," he says. Once considered too expensive for many companies, SSDs have come down in price to the point where "even midsize companies can afford layers of SSDs between their disks and their processors," says Csaplar.
Cloud-based storage is playing an increasingly important role in big data storage strategies. In industries where companies have operations around the world, such as oil and gas, data generated from sensors is being sent and stored directly to the cloud — and in many cases, analytics are being performed there as well.
"If you're gathering data from 10 or more sources, you're more than likely not backlogging it into a data center" because that isn't cost-effective with so much data, says IDC's Nadkarni.
GE, for instance, has been analyzing data on machines' sensors for years using "machine-to-machine" big data to plan for aircraft maintenance. Campisi says data collected for just a few hours off the blade of a power plant gas turbine can dwarf the amount of data that a social media site collects all day.
Sign up for CIO Asia eNewsletters.