It’s the age of big data. With ever more complex datasets around, and ever more information being made publicly available, protecting sensitive information has never been this hard. The good news: the methods to protect data are becoming more sophisticated, too. Here’s the first in a series of posts looking at different privacy challenges – and the appropriate techniques to address them.

In 1997, the Massachusetts state Group Insurance Commission (GIC) released hospital visit data which included each patient’s 5-digit zip code, gender, and date of birth. Governor Weld of Massachusetts reassured the public that the data was adequately ‘anonymised’, as primary identifiers (such as name and address) had been deleted. He was quickly proven wrong. Latanya Sweeney, an MIT graduate student, managed to find Governor Weld’s personal health records by combining the GIC data with an electoral roll database she bought for $20. She went on to show that around 87 percent of the US population can be uniquely identified with just their 5-digit zip code, gender, and date of birth (which are all publicly available).

Linkage attacks leverage quasi-identifiers that are common to different datasets

This goes to show that traditional notions of what constitutes an anonymised dataset can quickly break down. More than 20 years later, in the age of big data, attackers have access to more datasets and more powerful tools for linking and analysing them. Let’s look at one of the most common techniques that businesses use to protect sensitive information: data masking.

The problem with basic masking

Most organisations, public or private, that process personal or other sensitive information do some kind of data masking, i.e. they protect the primary identifiers in the data. Take a hypothetical patient, Alice, and a simple medical diagnosis dataset as an example:

NameDate of BirthPostcodeGenderProfessionDisease
Alice Smith13/10/1983NW10 8FNFemaleMechanical EngineerRespiratory

The organisation doesn’t want to violate Alice’s privacy by letting their data analysts see her medical diagnosis, but they want to use her data to learn how profession influences disease. To keep Alice safe, they change her record to:

NameDate of BirthPostcodeGenderProfessionDisease
TI-749283413/10/1983NW10 8FNFemaleMechanical EngineerRespiratory

If no other data about Alice existed, this record would now be anonymised. In reality, there are still four quasi-identifiers that link Alice to her sensitive disease attribute – date of birth, postcode, gender, and profession. This means that she, and this entire ‘masked’ dataset, are vulnerable to a linkage attack, in which an attacker can use other available datasets to identify individuals – as was the case in Massachusetts.

Clearly, Alice and the dataset aren’t safe yet. One way organisations try to make the data safer is by introducing a rule to generalise specific attributes like date of birth and postcode. This turns the data into a range, e.g. a year instead of a date of birth, a more general postcode instead of a specific one:

NameDate of BirthPostcodeGenderProfessionDisease
TI-74928341983NW10FemaleMechanical EngineerRespiratory

This is still a relatively haphazard approach to privacy and data utility. Here’s why:

  • From a privacy perspective, this approach is difficult to scale.  Every quasi-identifying attribute across an organisation’s datasets would need a rule that takes into account the possible linkability to external datasets (both public and leaked, both current and future). Also, even at this generalised state, by not having a generalisation rule for profession, Alice may still be the only female mechanical engineer born in 1983 living in NW10 – and therefore, identifiable.
  • From a utility perspective, it just removed information from the dataset in a rigid, ‘weakest link’ approach – the rule applied to every record is (or should be) the rule needed to keep the most vulnerable record safe. To start making the dataset low-risk in a manner that optimises both utility and privacy, we need to turn to a more sophisticated solution.

The need for k-anonymity

K-anonymity, invented by Latanya Sweeney, is a type of generalisation that’s designed to make your dataset as valuable as possible for a configured level of privacy. It works by making sure that individuals are ‘hiding in a crowd’ of at least size k. So for k-anonymity to be achieved, there need to be at least k individuals in the dataset who share the same set of attributes that might become identifying for each individual, thereby thwarting linkage attacks. This makes the generalisation strategy targeted, so that if Alice is indeed the only female mechanical engineer in our dataset who was born in 1983 and lives in NW10, we will either generalise her profession to something like ‘engineer’ if that leads to a crowd of k, or we will suppress her record. Consider the following simplified dataset:

Sebastian CushmanSW1 4ZEMaleRespiratory
Reece GlasserSW1 2HYMaleNo Illness
Tilly GelbermanNW10 8FNFemaleCancer
Alice SmithNW10 8FNFemaleRespiratory
Elise LeavittNW10 8FNFemaleCardiovascular
Morgan GwinnE17 9QYMaleRespiratory
George KnightE17 3SFMaleLiver
Sienna DavidsenE17 5WDFemaleCancer

To create a safe version of the data with a privacy level of k=3, we need to generalise the quasi-identifiers to create ‘crowds’ of three. We will generalise by treating postcode and gender as our quasi-identifiers, and disease as our sensitive value.

*SW1 *MaleRespiratory
*SW1 *MaleNo Illness
*NW10 8FNFemaleCancer
*NW10 8FNFemaleRespiratory
*NW10 8FNFemaleCardiovascular
*E17 **Respiratory
*E17 **Liver
*E17 **Cancer

You can see:

  1. We generalised as little as possible. Since we had three people who were female living in NW10 8FN, we were able to maintain the full utility of the postcode and gender values, while still providing the k=3 privacy guarantee.
  2. We generalised as much as necessary. For the individuals living in the E17 * postcodes, we completely suppressed gender from our output because there weren’t enough people of each gender to safely reveal it.
  3. k-anonymity creates more accurate results as your data grows. And as your dataset grows, the values will become less and less distorted as it’ll become easier to group individuals by similar quasi-identifiers. This means you can run analytics on a reduced-risk dataset with more and more accuracy.
    (For a deep-dive, here’s my colleague Will’s introduction to k-anonymity)

The complexity of big data

So, have we solved privacy with the silver bullet of k-anonymity? Not quite. What happens if every individual in your dataset is so distinct that their data can’t be reasonably generalised? Think about the location data that is collected from your phone. No one in the world follows a similar path to you, and it would only take a few known points (i.e. where you live, where you work, where I saw you at a coffee shop at 3 p.m. yesterday, etc.) to find your complete geospatial history in a dataset. Generalising location data tends to quickly destroy utility (e.g. generalising to 8 hour, 1 km square tiles).  The same is true of credit card transactions. Using high dimensional, sparse data requires a different approach. In the next post in the series, we’ll talk about one of those approaches: differential privacy.

Stay tuned – or subscribe to our blog to make sure you don’t miss content like this. Simply click here, then enter your email in the little box on the right.