Search plus big data
In many cases, using search with Spark or your favorite machine library may be the ticket. I've talked about methods for adding search to Hadoop, but there are also methods for adding Spark, Hadoop, or machine learning to search.
After the dust settled on Spark, anyone working with it realized that it wasn't magic beans and there are real issues with working in memory. For data you can index, being able to quickly pull back your working set for analysis is far better than a big fat I/O pull into memory to find what you're looking for.
Search and context
But search isn't only how you solve your "find my working set," memory, or I/O issues. One of the weaknesses of most big data projects is the lack of context. I've talked about this in terms of security, but what about your user experience? While you're streaming every little bit of data you can find about the user, how are you working with that to personalize the user experience?
Using the things you know about users (aka signals), you can improve the information you put in front of them. This might mean streaming analytics on the front end of your user interaction and a faceted search on the back end when you show them results or a personalized webpage.
The search solution
As a data architect, engineer, developer, or scientist, you need more than one or two options in your toolbelt. I get very annoyed at the approach of "let's store a big blob and hope for the best while we pay to sort through it every single time we use it." Some vendors actually seem to espouse that.
Using indexes and search technology, you can compose a better workset. You can also avoid implementing machine learning or analytics and simply "pick" the data via criteria out of storage -- and via signals even personalize data for users based on your data streams. Search is good. Use it.
Sign up for CIO Asia eNewsletters.