Machine-generated log data is the dark matter of the big data cosmos. It is generated at every layer, node, and component within distributed information technology ecosystems, including smartphones and Internet-of-things endpoints. It is collected, processed, analyzed, and used everywhere, but mostly behind the scenes.
Log data is fundamental to many of the least glamorous enterprise applications, such as troubleshooting, debugging, monitoring, security, antifraud, compliance, and e-discovery. However, it can also be a powerful tool for analyzing clickstream, geospatial, social media, and other logged behavioral data relevant to many customer-centric use cases.
Mortals can barely keep up with machine-logged data. Most of it is not designed or intended for direct human analysis. Unless filtered with brutal efficiency, the extreme volumes, velocities, and varieties of log data can quickly overwhelm human cognition. The authors of this recent Accenture article explain it succinctly:
[A]s the volume and variety of log files rises, it becomes increasingly difficult for log management solutions to parse log files, trace potential issues, and actually find errors -- particularly when cross-log correlations come into play. Even in the best-case scenarios, it requires an experienced operator to follow event chains, filter noise, and eventually diagnose the root cause to a complex problem.
Clearly, automation is key to finding insights within log data, especially as it all scales into big data territory. Automation can ensure that data collection, analytical processing, and rule- and event-driven responses to what the data reveals are executed as rapidly as the data flows. Key enablers for scalable log-analysis automation include machine-data integration middleware, business rules management systems, semantic analysis, stream computing platforms, and machine-learning algorithms.
Among these, machine learning is the key for automating and scaling distillation of insights from log data. But machine learning is not a one-size-fits-all approach to log-data analysis. Different machine-learning techniques are suited to different types of log data and to different analytical challenges. When the correlations and other patterns sought through machine learning can be specified a priori, supervised learning is the way to proceed. However, supervised learning requires a human expert to prepare a reference "training data" set from the log in order to refine a machine-learning algorithm's ability to discern the most relevant patterns.
But when the log-data patterns cannot be precisely defined in advance, unsupervised and reinforcement learning may be more appropriate. Those are the machine-learning-powered, log-data-analysis scenarios most amenable to full automation, because they can pick out and prioritize the most relevant patterns to the task at hand without need of human-supplied training-data sets. (For links to further details on these machine-learning approaches, see my recent post.)
Multilog correlation is a core log-data analysis use case for unsupervised and reinforcement learning. As heterogeneous log-data sets are combined and grow more heterogeneous, complex, and inscrutable, the most interesting data variables and relationships are not at all clear in advance of the analysis. Consequently, the hidden patterns may remain invisible if we merely try to view them using simple queries, pre-existing reports and dashboards, and other standard analytic views. In these cases, machine learning can pull out the most noteworthy patterns for further exploration by using various quantitative approaches such as clustering, Markov models, self-organizing maps, and so forth.
Sign up for CIO Asia eNewsletters.