A powerful way to derive new value from data is to create data products, which draw valuable insights from data without sharing the underlying raw data itself. For example, a bank might analyse many customers’ transactions to allow their business customers to benchmark themselves against others; mobile phone location data has been used to create covid mobility reports; and professional social network data has been used to create insights into talent markets. Furthermore many organisations, such as healthcare providers and government and statistics organisations, have a requirement to publish reports or statistics drawn from sensitive data.
When creating data products and reports, it’s vital to ensure that privacy is protected, and the simple restriction that only higher level statistics are shared is not enough to guarantee privacy. Indeed, it’s a common fallacy that privacy is protected by only sharing statistics aggregated over multiple records. This mistaken belief has led to the development of flawed or brittle ad-hoc approaches, such as requiring that tables of statistics always have a minimum cell count, and all small values are suppressed. Unfortunately, when many different aggregate queries are performed, the combination of these can allow the reconstruction of individual data records.
Imagine you want to compute the average salary at the large company where you work. Average salary is an aggregate query over many individuals, and so you would expect it couldn’t reveal information about any individual’s salary. However, when a new person joins the company, that person’s salary can easily be calculated by looking at the change between today’s and yesterday’s average salary!
Differential privacy is a very promising technique, and is well suited to analysis over a large population of data – typically at least tens of thousands of individuals. Some quantities, such as counts, can be computed very accurately, while others require more noise addition and hence greater accuracy loss to provide good protection (for example to compute average salary safely, the scale of noise is set by the person with the largest salary in the dataset). Understanding and calculating this sensitivity is vital when designing efficient differentially private systems.
Differential privacy is a strong, mathematical definition of privacy in the context ofstatistical and machine learning analysis. It is used to enable the collection, analysis, and sharing of a broad range of statistical estimates, such as averages, contingency tables, and synthetic data, based on personal data while protecting the privacy of the individuals in the data.
Do contact us if you’d like our help to create and share
insights from data safely, or if you’re exploring differential
privacy and need help in putting it into production safely.