Why You Can’t Solve Big Data Privacy With Small Data Tools

March 1, 2018

It’s the age of big data. With ever more complex datasets around, and ever more information being made publicly available, protecting sensitive information has never been this hard. The good news: the methods to protect data are becoming more sophisticated, too. Here’s the first in a series of posts looking at different privacy challenges – and the appropriate techniques to address them.

In 1997, the Massachusetts state Group Insurance Commission (GIC) released hospital visit data which included each patient’s 5-digit zip code, gender, and date of birth. Governor Weld of Massachusetts reassured the public that the data was adequately ‘anonymised’, as primary identifiers (such as name and address) had been deleted. He was quickly proven wrong. Latanya Sweeney, an MIT graduate student, managed to find Governor Weld’s personal health records by combining the GIC data with an electoral roll database she bought for $20. She went on to show that around 87 percent of the US population can be uniquely identified with just their 5-digit zip code, gender, and date of birth (which are all publicly available).

Linkage attacks leverage quasi-identifiers that are common to different datasets

This goes to show that traditional notions of what constitutes an anonymised dataset can quickly break down. More than 20 years later, in the age of big data, attackers have access to more datasets and more powerful tools for linking and analysing them. Let’s look at one of the most common techniques that businesses use to protect sensitive information: data masking.

The problem with basic masking

Most organisations, public or private, that process personal or other sensitive information do some kind of data masking, i.e. they protect the primary identifiers in the data. Take a hypothetical patient, Alice, and a simple medical diagnosis dataset as an example:

Name Date of Birth Postcode Gender Profession Disease
Alice Smith 13/10/1983 NW10 8FN Female Mechanical Engineer Respiratory

 The organisation doesn’t want to violate Alice’s privacy by letting their data analysts see her medical diagnosis, but they want to use her data to learn how profession influences disease. To keep Alice safe, they change her record to:

Name Date of Birth Postcode Gender Profession Disease
TI-7492834 13/10/1983 NW10 8FN Female Mechanical Engineer Respiratory


If no other data about Alice existed, this record would now be anonymised. In reality, there are still four quasi-identifiers that link Alice to her sensitive disease attribute – date of birth, postcode, gender, and profession. This means that she, and this entire ‘masked’ dataset, are vulnerable to a linkage attack, in which an attacker can use other available datasets to identify individuals – as was the case in Massachusetts.

Clearly, Alice and the dataset aren’t safe yet. One way organisations try to make the data safer is by introducing a rule to generalise specific attributes like date of birth and postcode. This turns the data into a range, e.g. a year instead of a date of birth, a more general postcode instead of a specific one:

Name Date of Birth Postcode Gender Profession Disease
TI-7492834 1983 NW10 Female Mechanical Engineer Respiratory


This is still a relatively haphazard approach to privacy and data utility. Here’s why:

  • From a privacy perspective, this approach is difficult to scale.  Every quasi-identifying attribute across an organisation’s datasets would need a rule that takes into account the possible linkability to external datasets (both public and leaked, both current and future). Also, even at this generalised state, by not having a generalisation rule for profession, Alice may still be the only female mechanical engineer born in 1983 living in NW10 – and therefore, identifiable.
  • From a utility perspective, it just removed information from the dataset in a rigid, ‘weakest link’ approach – the rule applied to every record is (or should be) the rule needed to keep the most vulnerable record safe. To start making the dataset low-risk in a manner that optimises both utility and privacy, we need to turn to a more sophisticated solution.

The need for k-anonymity

K-anonymity, invented by Latanya Sweeney, is a type of generalisation that’s designed to make your dataset as valuable as possible for a configured level of privacy. It works by making sure that individuals are ‘hiding in a crowd’ of at least size k. So for k-anonymity to be achieved, there need to be at least k individuals in the dataset who share the same set of attributes that might become identifying for each individual, thereby thwarting linkage attacks. This makes the generalisation strategy targeted, so that if Alice is indeed the only female mechanical engineer in our dataset who was born in 1983 and lives in NW10, we will either generalise her profession to something like ‘engineer’ if that leads to a crowd of k, or we will suppress her record. Consider the following simplified dataset:

Name Postcode Gender Disease
Sebastian Cushman SW1 4ZE Male Respiratory
Reece Glasser SW1 2HY Male No Illness
Tilly Gelberman NW10 8FN Female Cancer
Alice Smith NW10 8FN Female Respiratory
Elise Leavitt NW10 8FN Female Cardiovascular
Morgan Gwinn E17 9QY Male Respiratory
George Knight E17 3SF Male Liver
Sienna Davidsen E17 5WD Female Cancer


To create a safe version of the data with a privacy level of k=3, we need to generalise the quasi-identifiers to create ‘crowds’ of three. We will generalise by treating postcode and gender as our quasi-identifiers, and disease as our sensitive value.

Name Postcode Gender Disease
* SW1 * Male Respiratory
* SW1 * Male No Illness
* NW10 8FN Female Cancer
* NW10 8FN Female Respiratory
* NW10 8FN Female Cardiovascular
* E17 * * Respiratory
* E17 * * Liver
* E17 * * Cancer

You can see:

  1. We generalised as little as possible. Since we had three people who were female living in NW10 8FN, we were able to maintain the full utility of the postcode and gender values, while still providing the k=3 privacy guarantee.
  2. We generalised as much as necessary. For the individuals living in the E17 * postcodes, we completely suppressed gender from our output because there weren’t enough people of each gender to safely reveal it.
  3. k-anonymity creates more accurate results as your data grows. And as your dataset grows, the values will become less and less distorted as it’ll become easier to group individuals by similar quasi-identifiers. This means you can run analytics on a reduced-risk dataset with more and more accuracy.
    (For a deep-dive, here’s my colleague Will’s introduction to k-anonymity)

The complexity of big data

So, have we solved privacy with the silver bullet of k-anonymity? Not quite. What happens if every individual in your dataset is so distinct that their data can’t be reasonably generalised? Think about the location data that is collected from your phone. No one in the world follows a similar path to you, and it would only take a few known points (i.e. where you live, where you work, where I saw you at a coffee shop at 3 p.m. yesterday, etc.) to find your complete geospatial history in a dataset. Generalising location data tends to quickly destroy utility (e.g. generalising to 8 hour, 1 km square tiles).  The same is true of credit card transactions. Using high dimensional, sparse data requires a different approach. In the next post in the series, we’ll talk about one of those approaches: differential privacy.

Stay tuned – or subscribe to our blog to make sure you don’t miss content like this. Simply click here, then enter your email in the little box on the right.

Ready to learn more?

Our team of data privacy experts are here to answer your questions and discuss how data privacy can fuel your business.