Why data goes dark and why this matters

map-marker Bielefeld
clock3 July 3rd, 2018

There is an interesting parallelism between the pairs of matter and dark matter on the one side and data and dark data on the other side. The antagonist to „dark matter“ is not „light matter“ or „bright matter“, but simply „matter“.  The contrast is thus not between „dark“ and „light“, but simply between „dark matter“ and matter proper. In fact, the modifier „dark“ does not characterize a property of matter itself, but describes an epistemic or cognitive state, more in particular a lacuna in our ability to understand the essence of non-matter in contrast to matter as we can perceive it with our senses.

Let’s turn to data. Data, seen as an accumulation of information, exists as long as mankind. It was crucial for our survival to produce, collect and store data in our brains. Data about our environment, about the nearest source of water, the color of edible fruits. Data about the shape of a face and the shape of a fang. Data is the most natural, basic element of survival. Data as we might understand it today moved from the brain to the computer, being a kind of extension to our brain with its limited capacity. Shifting its locus of storage and its means of production however has not affected its importance.

Data accumulated over time and over experiences still represents the basis for informed decision making. When data becomes so much and so hard to handle, when it becomes so various, heterogeneous and massive that it eludes our cognitive capabilities, it becomes „dark“, that is it escapes our senses, representing a lacuna in our knowledge as we can not get access to it. Thus, much as the modifier „dark“ in „dark matter“, the term „dark“ in „dark data“ does not characterize a property of data itself, but describes our inability to make sense of data.

Data becomes thus dark when we loose the ability to handle it. But when does this happen? Let’s have a closer look into the nature of data:

The type of data can be spread out on a data continuum, with three areas:

structured               semi-structured             unstructured 

Structured data is found for instance in database-tables or excel spreadsheets.

Semi-structured data can be found in XML-code or JSON for instance.

Unstructured data comprises of documents, in particular any texts including emails, graphics, sensor data, videos, images, etc.

  • This type of data is formatted into a data model with a formal structure, so that its elements can be addressed, organized and accessed in various combinations. This rigid organization makes structured data easily accessible by rather simple search engine algorithms. Thus, they are easy to evaluate and to exploit.
  • This type of data is neither organized in the formal structure of a data model, nor is it completely unstructured. It may contain elements that enforce hierarchies of records or constitute fields within the data. These elements can be tags or markers, e.g. to separate semantic elements. This type of structure is also known as self-describing structure, for the structure is evolving out of the composition of the data, in contrast to the data being formatted into a rigid, prescribed model (as the structured data are). Unstructured data comprises of documents, in particular any texts including emails, graphics, sensor data, videos, images, etc.
  • This type of data is neither organized in a data-model nor in any other pre-defined manner. Typically, it contains a lot of text, but there might also be numbers, dates and other facts. This hodgepodge of different data entails irregularities and ambiguities, that makes it hard for traditional analyzing tools to make sense of the data.

Now that we examined data, we can understand how it becomes dark:
Generally, dark data can emerge from any part of the data continuum, structured as well as semi-structured as well as unstructured. However, most of dark data emerges from the area of unstructured data. This is due to the fact that data from unstructured sources cannot be processed using standard operations of data including filtering, projecting, joining, aggregating, averaging, etc. Hence, there is failure in processing the valuable information that are enclosed in the data. Modern, yet-in-development-techniques such as data mining, natural language processing and text analysis are needed to find patterns in order to interpret this type of data. Particularly challenging is textual content, as it requires a process of so called machine reading. That is the ability of a machine to make sense of natural language text.

Data thus becomes dark when:

  • it is left behind from processes
  • It is dismissed as valueless although it would be highly valuable if analyzed properly
  • there is no tool to capture and unlock the hidden information
  • the sheer amount exceeds the analyses-capacity
  • the available / feasible methods of analysis can only access structured data sets

Dark matter is matter that is invisible or dark to the standard astronomical equipment for it does not seem to interact with observable electromagnetic radiation, such as light. This is similar to „dark data“, which is also invisible to standard tools and instruments for analytical processing that have been developed for structured data mainly. And there is a further interesting parallelism: while dark matter is believed to account for 80% of all the matter in the universe, unstructured data amounts to 80% of all a company’s data.

Are you ready for bringing light into your dark data? Semalytix can help.