Only store data you need

By Jason McFall - July 14, 2016

Lesson1-1.jpg

I was very lucky to spend part of my university studies in Hamburg, and I have a fondness for charming German compound nouns. A new favourite is Datensparsamkeit, which might be translated as ‘data frugality’. In practical terms it means thinking carefully about the data you are capturing, and not gratuitously storing personal data that you don’t need, just because it might be useful one day. English speakers have settled on ‘data minimisation’, which works but doesn’t quite capture the idea that Datensparsamkeit is an attitude as well as a practice.

As a very simple example, Facebook strips metadata from the photographs users post, because it could identify the time and location of the photograph. A data scientist performing a marketing analysis might be interested in a customer's age, gender, city and purchase history, but they certainly don’t need to store the customer name, date of birth, precise address or payment details in their analytics datastore.

The Datensparsamkeit principle is in clear conflict with one of the early attractions of big data technologies such as Hadoop - creating a data lake with as much detailed data as possible, because you can’t anticipate every analysis you might want to perform or predict what interesting patterns a smart data scientist might find in the data.  But exactly the same argument applies to privacy risk.

One problem with highly granular transactional or behavioural data is that it’s hard to anticipate what private information might be contained there and one day discoverable. Consider the location measurements captured by your phone provider as your mobile phone connects to cell towers - these can be used to build up trajectories of how you move around - where you are at what time of day, where you live, work, spend your free time, and who you associate with. This data can be extremely sensitive, which is why responsible telecom providers strip this data of identifiers, and only store it for a restricted period to ensure service quality.

Start with frugality

The good news is that the highly granular transaction or interaction data that might contain undiscovered valuable information is typically high volume and often high velocity, rapidly becoming stale. So it’s better to start from a principle of frugality - if you later identify a justified need for some granular data which you weren’t storing, once you connect to the firehose that data will accumulate rapidly anyway.

Among our customers we see use of big data technologies maturing beyond the early unstructured data lake phase to be more disciplined, with supporting technologies and practices from the relational database and data warehouse world being carried over to big data. Data lineage, data quality, master data management, encryption and security are all becoming essential for Hadoop and big data technologies.  And stronger regulation and the financial and reputational risk of a data breach make it impossible to justify storing personal data ‘just in case’.

A corollary to not storing personal data you don’t need is not sharing data the recipient doesn’t need. Don’t do the lazy thing and just give out a data extract - think carefully about what data is needed, and what it is to be used for, and share only what is required.  And of course anonymise it before you share it - we’ll talk about how to do that over the course of the next few weeks.

Post 1 in the series introduced by this post: Privacy Engineering: 6 key lessons for data practitioners