Why data goes dark and why this matters

There is an interesting parallelism between the pairs of matter and dark matter on the one side and data and dark data on the other side. The antagonist to “dark matter” is not “light matter” or “bright matter”, but simply “matter”. The contrast is thus not between “dark” and “light”, but simply between “dark matter” and matter proper. In fact, the modifier “dark” does not characterize a property of matter itself, but describes an epistemic or cognitive state, more in particular a lacuna in our ability to understand the essence of non-matter in contrast to matter as we can perceive it with our senses.

Let’s turn to data. Data, seen as an accumulation of information, exists as long as mankind. It was crucial for our survival to produce, collect and store data in our brains. Data about our environment, about the nearest source of water, the color of edible fruits. Data about the shape of a face and the shape of a fang. Data is the most natural, basic element of survival. Data as we might understand it today moved from the brain to the computer, being a kind of extension to our brain with its limited capacity. Shifting its locus of storage and its means of production however has not affected its importance.

Data accumulated over time and over experiences still represents the basis for informed decision making. When data becomes so much and so hard to handle, when it becomes so various, heterogeneous and massive that it eludes our cognitive capabilities, it becomes “dark”, that is it escapes our senses, representing a lacuna in our knowledge as we can not get access to it. Thus, much as the modifier “dark” in “dark matter”, the term “dark” in “dark data” does not characterize a property of data itself, but describes our inability to make sense of data.

Data becomes thus dark when we loose the ability to handle it. But when does this happen? Let’s have a closer look into the nature of data:

The type of data can be spread out on a data continuum, with three areas:

structured
semi-structured
unstructured

Structured data is found for instance in database-tables or excel spreadsheets.

Semi-structured data can be found in XML-code or JSON for instance.

Unstructured data comprises of documents, in particular any texts including emails, graphics, sensor data, videos, images, etc.

This type of data is formatted into a data model with a formal structure, so that its elements can be addressed, organized and accessed in various combinations. This rigid organization makes structured data easily accessible by rather simple search engine algorithms. Thus, they are easy to evaluate and to exploit.
This type of data is neither organized in the formal structure of a data model, nor is it completely unstructured. It may contain elements that enforce hierarchies of records or constitute fields within the data. These elements can be tags or markers, e.g. to separate semantic elements. This type of structure is also known as self-describing structure, for the structure is evolving out of the composition of the data, in contrast to the data being formatted into a rigid, prescribed model (as the structured data are). Unstructured data comprises of documents, in particular any texts including emails, graphics, sensor data, videos, images, etc.
This type of data is neither organized in a data-model nor in any other pre-defined manner. Typically, it contains a lot of text, but there might also be numbers, dates and other facts. This hodgepodge of different data entails irregularities and ambiguities, that makes it hard for traditional analyzing tools to make sense of the data.

Now that we examined data, we can understand how it becomes dark:

Generally, dark data can emerge from any part of the data continuum, structured as well as semi-structured as well as unstructured. However, most of dark data emerges from the area of unstructured data. This is due to the fact that data from unstructured sources cannot be processed using standard operations of data including filtering, projecting, joining, aggregating, averaging, etc. Hence, there is failure in processing the valuable information that are enclosed in the data. Modern, yet-in-development-techniques such as data mining, natural language processing and text analysis are needed to find patterns in order to interpret this type of data. Particularly challenging is textual content, as it requires a process of so called machine reading. That is the ability of a machine to make sense of natural language text.

Data thus becomes dark when:

it is left behind from processes
It is dismissed as valueless although it would be highly valuable if analyzed properly
there is no tool to capture and unlock the hidden information
the sheer amount exceeds the analyses-capacity
the available / feasible methods of analysis can only access structured data sets

Dark matter is matter that is invisible or dark to the standard astronomical equipment for it does not seem to interact with observable electromagnetic radiation, such as light. This is similar to “dark data”, which is also invisible to standard tools and instruments for analytical processing that have been developed for structured data mainly. And there is a further interesting parallelism: while dark matter is believed to account for 80% of all the matter in the universe, unstructured data amounts to 80% of all a company’s data.

Are you ready for bringing light into your dark data? Semalytix can help.

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	Google Recaptcha service sets this cookie to identify bots to protect the website against malicious spam attacks.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Analytics" category.
cookielawinfo-checkbox-functional	1 year	The GDPR Cookie Consent plugin sets the cookie to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Necessary" category.
cookielawinfo-checkbox-others	1 year	Set by the GDPR Cookie Consent plugin, this cookie stores user consent for cookies in the category "Others".
cookielawinfo-checkbox-performance	1 year	Set by the GDPR Cookie Consent plugin, this cookie stores the user consent for cookies in the category "Performance".
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
PHPSESSID	session	This cookie is native to PHP applications. The cookie stores and identifies a user's unique session ID to manage user sessions on the website. The cookie is a session cookie and will be deleted when all the browser windows are closed.
viewed_cookie_policy	1 year	The GDPR Cookie Consent plugin sets the cookie to store whether or not the user has consented to use cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	1 year 1 month 4 days	Google Analytics sets this cookie to calculate visitor, session and campaign data and track site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.
CONSENT	2 years	YouTube sets this cookie via embedded YouTube videos and registers anonymous statistical data.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	5 months 27 days	YouTube sets this cookie to measure bandwidth, determining whether the user gets the new or old player interface.
YSC	session	Youtube sets this cookie to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-device-id	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.

Why data goes dark and why this matters

Request a Demo

Recent Posts

Recent Comments