Privacy attacks aim to reverse data privacy protection mechanisms such as generalization, aggregation or perturbation, to expose sensitive information about individuals in a dataset. Techniques used can target aggregate data like summary counts, histograms or average statistics. Mathematical breakthroughs, easy access to more powerful compute platforms, and widespread availability of large and varied public data sets have made these techniques a growing threat to the confidentiality of aggregate data releases.

One common type of vulnerability is called a differencing attack. It uses background knowledge about an individual person to learn sensitive information about that person by taking into account multiple statistics in which the target’s data was included.

In the simplest case, a differencing attack requires only two data points. Consider an example of aggregate statistics in a retail use case, based around a fictional Loyalty Card data product. The data product contains the total amount spent by all customers on a given day, and the total amount spent by the subgroup of customers using a loyalty card. If there is exactly one customer who makes a purchase without a loyalty card, some simple arithmetic on the two statistics for that day reveals this customer’s precise total amount spent, with only the release of aggregate values.

This simple example of a differencing attack seems straightforward enough, but real-world differencing attacks can become much more complex. They can combine many more data points in arbitrary arithmetic combinations to drill down into the aggregate data until an individual’s information can be singled out. While finding and carrying out such complex computation is an easy task for algorithms programmed to do so, detecting these patterns through human review is much more difficult.

Reconstruction attacks pose an even greater risk to the privacy of aggregated data. These vulnerabilities leverage the basic connection between row-level and aggregate releases from a dataset to turn a collection of statistics into an increasingly accurate record for each individual.

Every aggregate data point released gives partial information about the values in the underlying row-level data because it reduces the universe of possible records that could have generated the statistic. When a collection of statistics slices the data in multiple dimensions, this further narrows down the universe of datasets that are consistent with the published information. A sufficiently large number of aggregate tables will therefore inevitably result in a highly accurate reconstruction of each individual record.

Simson Garfinkel, Senior Scientist at the US Census Bureau’s team for disclosure avoidance, gave a simple example of a reconstruction attack in his recent keynote at PETS 2019. As he explained, publishing the frequency count, mean and median age, of a population broken down by a few demographics only, allows anyone with access to the statistics and a personal computer to accurately reconstruct the personal data of the survey population.

The US Census Bureau has recognised how real this risk is, and in 2018 announced that it would modernise its disclosure avoidance system to protect the 2020 US Census. It carried out internal experiments on the 2010 Census and found that the eight billion statistics published about the 308 million confidential records, 25 data points per person allowed accurate reconstruction of confidential records for 46% of the US population.

This demonstrates that database reconstruction is no longer a theoretical danger, and every modern statistical disclosure control system should be adapted to protect against these risks.