Privacy Engineering: 6 key lessons for data practitioners

By Jason McFall - July 05, 2016

Eye_visual.jpgAt the recent Strata + Hadoop World Conference in London, I gave a presentation about Protecting Individual Privacy in a Data-driven World. If you’re interested but couldn’t attend, you can watch a video here. Otherwise, continue reading and find out more about my 6 key lessons for data practitioners.

During my presentation, I talked about the great benefits that can come from sharing and analysing rich and sometimes sensitive data about people, but pointed out the clear risks to privacy if data is shared without sufficient care. I’m optimistic that we can use technology to address these risks, and I talked about practical measures including:

  • tokenisation and masking
  • statistical generalisation and blurring of data (such as k-anonymity)
  • differential privacy, which can be applied at the point of data collection or data analysis

I concluded the presentation with six clear and practical lessons for data practitioners who want to ensure they do the right thing and protect privacy. There was much interest in this, so I wanted to share those lessons here and continue the discussion.

My 6 key lessons for data practitioners:

1) Only store data you need

2) Always remove primary identifiers

3) Aggregate and statistically anonymise data before sharing - You can’t foresee all future datasets and potential linkage risks and you can’t foresee the future power of machine learning on this data

4) If data is too complex to anonymise, extract the features of interest and anonymise and share only those that are required

5) Even better, don’t share the data itself, allow secure queries on the data

6) Be open and clear about how you protect and use private data

I will explore each of these lessons in more detail, discussing real-world examples and giving some insight into what could go wrong and what to do instead.