Privitar Labs

Sharing data insights using differential privacy

Sharing data insights using differential privacy

A powerful way to derive new value from data is to create data products, which draw valuable insights from data without sharing the underlying raw data itself. For example, a bank might analyse many customers’ transactions to allow their business customers to benchmark themselves against others; mobile phone location data has been used to create covid mobility reports; and professional social network data has been used to create insights into talent markets. Furthermore many organisations, such as healthcare providers and government and statistics organisations, have a requirement to publish reports or statistics drawn from sensitive data.

It’s a widely-held fallacy that aggregating data ensures privacy.
It does not.

When creating data products and reports, it’s vital to ensure that privacy is protected, and the simple restriction that only higher level statistics are shared is not enough to guarantee privacy. Indeed, it’s a common fallacy that privacy is protected by only sharing statistics aggregated over multiple records. This mistaken belief has led to the development of flawed or brittle ad-hoc approaches, such as requiring that tables of statistics always have a minimum cell count, and all small values are suppressed. Unfortunately, when many different aggregate queries are performed, the combination of these can allow the reconstruction of individual data records.

Imagine you want to compute the average salary at the large company where you work. Average salary is an aggregate query over many individuals, and so you would expect it couldn’t reveal information about any individual’s salary. However, when a new person joins the company, that person’s salary can easily be calculated by looking at the change between today’s and yesterday’s average salary!

A quantifiable definition of privacy

Differential privacy is an emerging technique which addresses this problem, and is the focus of great optimism and growing attention in the academic community, and is now seeing its first industrial applications. Differential Privacy is attractive because it brings a very clear definition of privacy, and its mathematics leads to very clear privacy guarantees. However, it is only suited to a subset of data analysis use cases, and the technique can be hard for data users to manage and use effectively.

The US census is a flagship adopter of differential privacy. The Census Bureau compiles billions of statistics from the data they collect. The Bureau has demonstrated that previous techniques used to protect this wealth of intersecting statistics in the 2010 census were insufficient, and still allow the underlying raw records to be reconstructed. As the Census Bureau has a legal duty of confidentiality, it is adopting differential privacy to protect the 2020 Census data.

Differential privacy works by adding a small amount of random noise to the aggregate values calculated from the data. In essence, the noise is roughly the same size as any one individual’s contribution to the result, and so this noise thwarts any attempt to reconstruct any individual’s value. And if the dataset is large, the noise will introduce only a small approximation to the true answer, and so the differentially private value will be useful for analysis and business decisions.

Understanding the sensitive side of data

Differential privacy is a very promising technique, and is well suited to analysis over a large population of data – typically at least tens of thousands of individuals. Some quantities, such as counts, can be computed very accurately, while others require more noise addition and hence greater accuracy loss to provide good protection (for example to compute average salary safely, the scale of noise is set by the person with the largest salary in the dataset). Understanding and calculating this sensitivity is vital when designing efficient differentially private systems.

Differential privacy can be challenging to adopt

Calibrating the noise level:
A principled approach must be taken to determine the correct noise level, to ensure that noise is added at a sufficient level to protect privacy, but without unnecessary loss of accuracy.
Restricting the total information leakage:
There must be a restriction on how many queries can be made, as every query reveals some information.
Dealing with rapid change:
There aren’t yet good principled mechanisms to deal with data which is changing rapidly over time.
At Privitar, we’re working to solve these technical challenges to enable differential privacy to be applied correctly to build practical solutions to real world use cases. In particular, we’ve done a lot of work in understanding privacy risks in aggregate data releases, and how to effectively choose the mathematical parameters of differential privacy algorithms.

Differential privacy opens up new data opportunities

Creating safe data products from sensitive raw data
Monetising data by sharing insights or interactive analytics
Enabling well-defined analytic operations on large datasets

Additional Resources

Visit Resource Library

Differential Privacy: A Primer for a Non-technical Audience

Differential privacy is a strong, mathematical definition of privacy in the context ofstatistical and machine learning analysis. It is used to enable the collection, analysis, and sharing of a broad range of statistical estimates, such as averages, contingency tables, and synthetic data, based on personal data while protecting the privacy of the individuals in the data.
 Download PDF

Team up with Privitar Labs

Do contact us if you’d like our help to create and share
insights from data safely, or if you’re exploring differential
privacy and need help in putting it into production safely.