5. Slogging through logs
Whether you're intent on detecting intruders or you think the auditors are coming, you need to capture logs and make them searchable. Splunk has made a killing here, but there are other, more flexible options in big data.
6. Because I don't want to buy Teradata
This is not Teradata's year. Big data has eaten at the edges (including the growth frontier), and now Apache Kylin has made cubes available to everyone. MPP as a distributed system is becoming more significant, thanks to Impala and HAWQ/Greenplum. There's less room for a big expensive item that only does that thing and doesn't fit in with other data analytics efforts -- and can't go to the cloud unless it's a single vendor's weird, proprietary cloud.
7. Plain old extract, tranform, load
ETL is still probably the most common Hadoop workload today -- and I'd venture to say that ETL is the most common nonstreaming workload for Spark. By the way, 100 startups have come out of the woodwork claiming to handle this task.
8. Capture sensor data now, figure out what to do with it later
Whether it's the power grid, manufacturing, water pumps, or people driving around, stuff is telling us things. Those things need to be captured. Some people have even figured out what do with the data, but capturing it in time is the first big step, because many people feel it's technically hard.
Note that I usually push people to start thinking about analytics early in their big data projects. Why? Because sizing and designing a flow is far easier up front than rethinking the whole arrangement when the data is already in a box. But sometimes you need to ingest and hope for the best.
Over the past year, I've seen a few other project types, but most use cases fit in one of these eight categories. What do you folks see in the wild?
Sign up for CIO Asia eNewsletters.