Unstructured data is an oxymoron

messy workshop

Strictly speaking, “unstructured data” is a contradiction in terms. Data must have structure to be comprehensible. By “unstructured data” people usually mean data with a non-tabular structure.

Tabular data is data that comes in tables. Each row corresponds to a subject, and each column corresponds to a kind of measurement. This is the easiest data to work with.

Non-tabular data could mean anything other than tabular data, but in practice it often means text, or it could mean data with a graph structure or some other structure.

More productive discussions

My point here isn’t to quibble over language usage but to offer a constructive suggestion: say what structure data has, not what structure it doesn’t have.

Discussions about “unstructured data” are often unproductive because two people can use the term, with two different ideas of what it means, and think they’re in agreement when they’re not. Maybe an executive and a sales rep shake hands on an agreement that isn’t really an agreement.

Eventually there will have to be a discussion of what structure data actually has rather than what structure it lacks, and to what degree that structure is exploitable. Having that discussion sooner rather than later can save a lot of money.

Free text fields

One form of “unstructured” data is free text fields. These fields are not free of structure. They usually contain prose, written in a particular language, or at most in small number of languages. That’s a start. There should be more exploitable structure from context. Is the text a pathology report? A Facebook status? A legal opinion?

Clients will ask how to de-identify free text fields. You can’t. If the text is truly free, it could be anything, by definition. But if there’s some known structure, then maybe there’s some practical way to anonymize the data, especially if there’s some tolerance for error.

For example, a program may search for and mask probable names. Such a program would find “Elizabeth” but might fail to find “the queen.” Since there are only a couple queens [1], this would be a privacy breech. Such software would also have false positives, such as masking the name of the ocean liner Queen Elizabeth 2. [2]

Related posts

[1] The Wikipedia list of current sovereign monarchs lists only two women, Queen Elizabeth II of the UK and Queen Margrethe II of Denmark.

[2] The ship, also known as QE2, is Queen Elizabeth 2, while the monarch is Queen Elizabeth II.