Dark data is defined as information gathered by a company which falls outside of its day-to-day operation. Dark data is often referred to as data that people hoard as they think it may be useful one day.  The term ‘dark data’ can also be applied to information which is stored upon a device which is no longer in use or inaccessible.

Some examples of dark data include – server log files that highlight website visitor behaviour, customer call detail records that indicate consumer feedback and mobile geolocation data which reveals traffic patterns to enable business planning. If harnessed correctly, this abundance of seemingly superfluous information can be used to drive internal revenue streams. However, bringing old data to light isn’t without risk.

Large amounts of data that is not visible to data administrators so therefore exists primarily in personal files whose content is managed directly by individuals rather than by any corporate applications. It raises the question of quality. Rehashing old information for research or publication purposes, especially if it’s gleaned from the internet, is potentially problematic as very often the information source isn’t known, rendering its verification impossible.

Often when dark data is brought to light, the original meaning can frequently be lost in translation. This may cause possible compliance issues upon future publication. And what is to stop private or confidential information being copied into spreadsheets unwittingly?  So the compliance issue is raised once again. The problem is dark data is very often created using logic which was only designed to be understood by its creator – a simple misunderstanding a long way down the line could be very damaging.

It’s gradually being recognised that dark data is a potential source of great peril as well as value, so much so, financial regulators are reportedly becoming more aware and concerned about the risk inherent in spreadsheet models. Sooner, rather than later, regulators such as Enterprise Information Management (EIM) will have no choice but to confront the prevalence of sleeping data which is best left to lie.