The term "unstructured data", is truly an oxymoron. All data has structure, and in fact most data has multiple structures that allow us to inspect, analyse, transform, and derive value from it. The big question we need to ask is not, "Is the data structured?", but rather "Does our current understanding of the data's structures support the operations we desire to perform?"
Consider the example of a large set of Web pages. It is possible to have a number of progressively more refined structural understandings of this information, such as:
1. The data is a sequence of 0s and 1s, i.e. binary information.
2. There are files and directories with a few descriptive details - name, size, create date, etc.
3. File content is "marked up" with HTML tags providing an even richer structural understanding.
4. Readme Files, Style Sheets, XML schemas... may exist in the data set to tell us even more.
Even though this data may not be structured in a way as traditional as database records, it is structured. What we do not know, at least not yet, is does our understanding of this structure support the operations we want to perform?
This question ultimately comes down to how much of the semantics, the meaning of the information, is represented in the structural understanding that we currently have. In a database we can, in a very standard and well-known way, find a "schema" that tells us where each data element can be found within the structure. There is also robust meta-data, description information about the data, which further explains the data elements. This includes human-readable labels, data types, organisation of data elements into "entities" - e.g. this first name and last name data element are of an entity called Student, constraints on the data, relationships between entities - e.g. Student "studies-with" Teacher, and more.
In an HTML file, on the other hand, the structure is not always as revealing of the deeper meaning. I can probably figure out that a particular piece of data is a title when it is found within a <title></<title> tag-set. I may know that another piece of data should be underlined or emphasised because of how it is tagged, but I would not convincingly know why. Presumably this information is important, but at this level of structural understanding, we run out of clues as to what we can attribute that importance to. Of course, this was by design. The Hyper Text Markup Language (HTML) was designed to structurally convey the meaning of "how to render the information", typically within a Web browser, as visible or audible Web page experiences. So:
Are HTML pages unstructured? Absolutely not.
Sign up for CIO Asia eNewsletters.