Untangling the Definition of Unstructured Data

With the increasing pulse rate of articles, investments, and activities that feature technologies supporting information governance, auto-classification, and textual analytics, it is important to understand some of the similarities and differences between structured and unstructured data. The following short extract highlights several similarities and differences in these types of data and also shares a reference link to additional considerations and approaches for classifying data.

Extract from an article by Bill Inmon of Forrest Rim Technology

There are at least two schools of thought that are very different about what constitutes the meaning of what is and what is not structured data. One school of thought, as stated previously, is that everything not in a standard DBMS is unstructured. Another definition is that something is unstructured only if there is not a rational way to explain the structure. These are two very different interpretations of what is meant by unstructured. And both viewpoints are perfectly rational and valid. However, they are in conflict with each other. These are just two viewpoints, and there are undoubtedly others on what constitutes the meaning of structured and unstructured data.

Based on some recent research, another less-confusing way for classifying data exists. That classification involves looking at the repetition of data occurrences. Data that occurs frequently, repetitive data, is data in a record that appears very similar to data in every other record. The records are similar in terms of size and structure, and in many cases, even their content is the same. Examples of repetitive data—and there are many—include metering data; click-stream data; telephone call records data, such as time of call, the caller’s telephone number, and the call’s length; analog data; and so on.

The converse of repetitive data, nonrepetitive data, is data in which each occurrence is unique in terms of content—that is, each nonrepetitive record is different from the others. Any similarity of record content, size, or structure that may exist among nonrepetitive data is strictly a matter of chance. There are many different forms of nonrepetitive data, and examples include emails, call center conversations, corporate contracts, warranty claims, insurance claims, and so on.

The many distinctions between repetitive and nonrepetitive data are important. But perhaps the most important distinction is the pattern of business value. Many occurrences of repetitive data in which only a few records are of real business value fall into a typical situation category.

  • Craig Ball

    I challenge relegating e-mail to the realm of unstructured data. In fact, almost all e-mail exists within a DBMS (Exchange, Domino, Outlook, Gmail, etc.) and all e-mail is governed by a regimented structure per the RFCs (header, message body, encoded attachments; each segment, in turn, adherent to an imposed structure to transit the network). The structure of every e-mail message is almost identical or identical to the structure of every other e-mail message, even if the content is not. The message is the record, and it’s field structure is consistent and repetitive.

  • ComplexD

    I think that a reasonable challenge given the impreciseness of defining structured and unstructured data. I have seen the term ‘semi-structured’ used in many cases where data is structured yet lacks the strict data model structure. Emails are classified by some as ‘semi-structured.’ (Webopedia)