Self-service access to safe data
Protect data and manage risk
Analyze conversational chat data
Reduce the time and cost to comply
Right data in the right hands
Align control and business use
Controlled access to data
Flexibility, consistency, scalability
Our professional services
Power responsible use
From clinical to commercial
Optimize data tests
Open new revenue streams
Realize the potential of the cloud
Protect data from misuse
Transform your data
Opinion and industry insights
An A to Z of the industry
The podcast for data leaders
Press releases, awards, and more
Staying at the cutting edge
The team behind Privitar
A thriving partner ecosystem
Our story, values, and careers
Dedicated customer assistance
Mar 01, 2018
It’s the age of big data. With ever more complex datasets around, and ever more information being made publicly available, protecting sensitive information has never been this hard. The good news: the methods to protect data are becoming more sophisticated, too. Here’s the first in a series of posts looking at different privacy challenges – and the appropriate techniques to address them.
In 1997, the Massachusetts state Group Insurance Commission (GIC) released hospital visit data which included each patient’s 5-digit zip code, gender, and date of birth. Governor Weld of Massachusetts reassured the public that the data was adequately ‘anonymised’, as primary identifiers (such as name and address) had been deleted. He was quickly proven wrong. Latanya Sweeney, an MIT graduate student, managed to find Governor Weld’s personal health records by combining the GIC data with an electoral roll database she bought for $20. She went on to show that around 87 percent of the US population can be uniquely identified with just their 5-digit zip code, gender, and date of birth (which are all publicly available).
This goes to show that traditional notions of what constitutes an anonymised dataset can quickly break down. More than 20 years later, in the age of big data, attackers have access to more datasets and more powerful tools for linking and analysing them. Let’s look at one of the most common techniques that businesses use to protect sensitive information: data masking.
Most organisations, public or private, that process personal or other sensitive information do some kind of data masking, i.e. they protect the primary identifiers in the data. Take a hypothetical patient, Alice, and a simple medical diagnosis dataset as an example:
The organisation doesn’t want to violate Alice’s privacy by letting their data analysts see her medical diagnosis, but they want to use her data to learn how profession influences disease. To keep Alice safe, they change her record to:
If no other data about Alice existed, this record would now be anonymised. In reality, there are still four quasi-identifiers that link Alice to her sensitive disease attribute – date of birth, postcode, gender, and profession. This means that she, and this entire ‘masked’ dataset, are vulnerable to a linkage attack, in which an attacker can use other available datasets to identify individuals – as was the case in Massachusetts.
Clearly, Alice and the dataset aren’t safe yet. One way organisations try to make the data safer is by introducing a rule to generalise specific attributes like date of birth and postcode. This turns the data into a range, e.g. a year instead of a date of birth, a more general postcode instead of a specific one:
This is still a relatively haphazard approach to privacy and data utility. Here’s why:
K-anonymity, invented by Latanya Sweeney, is a type of generalisation that’s designed to make your dataset as valuable as possible for a configured level of privacy. It works by making sure that individuals are ‘hiding in a crowd’ of at least size k. So for k-anonymity to be achieved, there need to be at least k individuals in the dataset who share the same set of attributes that might become identifying for each individual, thereby thwarting linkage attacks. This makes the generalisation strategy targeted, so that if Alice is indeed the only female mechanical engineer in our dataset who was born in 1983 and lives in NW10, we will either generalise her profession to something like ‘engineer’ if that leads to a crowd of k, or we will suppress her record. Consider the following simplified dataset:
To create a safe version of the data with a privacy level of k=3, we need to generalise the quasi-identifiers to create ‘crowds’ of three. We will generalise by treating postcode and gender as our quasi-identifiers, and disease as our sensitive value.
You can see:
So, have we solved privacy with the silver bullet of k-anonymity? Not quite. What happens if every individual in your dataset is so distinct that their data can’t be reasonably generalised? Think about the location data that is collected from your phone. No one in the world follows a similar path to you, and it would only take a few known points (i.e. where you live, where you work, where I saw you at a coffee shop at 3 p.m. yesterday, etc.) to find your complete geospatial history in a dataset. Generalising location data tends to quickly destroy utility (e.g. generalising to 8 hour, 1 km square tiles). The same is true of credit card transactions. Using high dimensional, sparse data requires a different approach. In the next post in the series, we’ll talk about one of those approaches: differential privacy.
Stay tuned – or subscribe to our blog to make sure you don’t miss content like this. Simply click here, then enter your email in the little box on the right.
Our team of data security and privacy experts are here to answer your questions and discuss how modern data provisioning can fuel business growth.