A powerful way to derive new value from data is to create data products, which draw valuable insights from data without sharing the underlying raw data itself. For example, a bank might analyse many customers’ transactions to allow their business customers to benchmark themselves against others; mobile phone location data has been used to create covid mobility reports; and professional social network data has been used to create insights into talent markets. Furthermore many organisations, such as healthcare providers and government and statistics organisations, have a requirement to publish reports or statistics drawn from sensitive data.
When creating data products and reports, it’s vital to ensure that privacy is protected, and the simple restriction that only higher level statistics are shared is not enough to guarantee privacy. Indeed, it’s a common fallacy that privacy is protected by only sharing statistics aggregated over multiple records. This mistaken belief has led to the development of flawed or brittle ad-hoc approaches, such as requiring that tables of statistics always have a minimum cell count, and all small values are suppressed. Unfortunately, when many different aggregate queries are performed, the combination of these can allow the reconstruction of individual data records.
Imagine you want to compute the average salary at the large company where you work. Average salary is an aggregate query over many individuals, and so you would expect it couldn’t reveal information about any individual’s salary. However, when a new person joins the company, that person’s salary can easily be calculated by looking at the change between today’s and yesterday’s average salary!
Differential privacy is an emerging technique which addresses this problem, and is the focus of great optimism and growing attention in the academic community, and is now seeing its first industrial applications. Differential Privacy is attractive because it brings a very clear definition of privacy, and its mathematics leads to very clear privacy guarantees. However, it is only suited to a subset of data analysis use cases, and the technique can be hard for data users to manage and use effectively.
The US census is a flagship adopter of differential privacy. The Census Bureau compiles billions of statistics from the data they collect. The Bureau has demonstrated that previous techniques used to protect this wealth of intersecting statistics in the 2010 census were insufficient, and still allow the underlying raw records to be reconstructed. As the Census Bureau has a legal duty of confidentiality, it is adopting differential privacy to protect the 2020 Census data.
Differential privacy works by adding a small amount of random noise to the aggregate values calculated from the data. In essence, the noise is roughly the same size as any one individual’s contribution to the result, and so this noise thwarts any attempt to reconstruct any individual’s value. And if the dataset is large, the noise will introduce only a small approximation to the true answer, and so the differentially private value will be useful for analysis and business decisions.
Differential privacy is a very promising technique, and is well suited to analysis over a large population of data – typically at least tens of thousands of individuals. Some quantities, such as counts, can be computed very accurately, while others require more noise addition and hence greater accuracy loss to provide good protection (for example to compute average salary safely, the scale of noise is set by the person with the largest salary in the dataset). Understanding and calculating this sensitivity is vital when designing efficient differentially private systems.
Differential privacy is a strong, mathematical definition of privacy in the context ofstatistical and machine learning analysis. It is used to enable the collection, analysis, and sharing of a broad range of statistical estimates, such as averages, contingency tables, and synthetic data, based on personal data while protecting the privacy of the individuals in the data.
Download PDF
Do contact us if you’d like our help to create and share
insights from data safely, or if you’re exploring differential
privacy and need help in putting it into production safely.