1. Who owns the data? This question is critical in organizations where IT is responsible for purchasing and planning infrastructure but doesn't participate in the utilization and management. Compliance and legal staff are also interested in understanding data ownership and custodians.
2. How do you capture this data in real time and share it? There aren't attractive options for doing this with unstructured data across an organization. Some of the current solutions periodically scan file shares to identify new or changed data and copy the contents to big data processing systems such as Hadoop. This approach puts unnecessary load on the file server and results in at least two copies of the original data, ballooning storage costs and management overhead.
3. Which properties of unstructured data should be identified? The answer to this question depends heavily on the company. For example, there are compliance issues that cut across many industries, such as Health Insurance Portability and Accountability Act (HIPAA), Payment Card Industry Data Security Standard (PCI DSS), and privacy laws that prevent the unencrypted storage of personally identifiable information. Automatically identifying when and where this sensitive data is created and consumed is both challenging and crucial.
When questions like these can be answered efficiently, a whole new set of use-cases emerge, saving time and making better use of existing investments. For example, you could quickly locate the newest version of the proposal document or identify who in the marketing department is most knowledgeable about Product X. It is often tremendously difficult for companies to meet these basic information needs.
Visualizing what's possible: data intelligence that works
Visualization has long been an important tool for helping users understand and act on complex information. For example, if you had a database of the stock prices of all companies taken at one-second intervals across all of 2008, looking only at the individual rows of data in a table, how long would it take you to determine how the market did overall and how several specific stocks did relative to the market?
Stock charts are one category of visualization. They can rapidly show performance over time, compare the relative performance of multiple securities, and show additional derived information such as moving average, relative strength index, and more.
Visualization tools like this are extremely valuable, but they have traditionally operated exclusively on highly structured data, like stock prices and sales records. As organizations create and consume massive amounts of unstructured data, there is tremendous value in extending visualization functionality to include unstructured data.
When attempting to visualize unstructured data, it is crucial for organizations to create and maintain a rich set of metadata structured data that describes the unstructured data. This metadata can be directly fed into visualization tools, or it can be used to link the unstructured data with the existing structured data. It is worth noting that the process of manually extracting and identifying metadata is costly and impractical, so some form of automatic annotation should be part of the solution.
Sign up for CIO Asia eNewsletters.