Differential privacy: touted as the solution for accessing all your sensitive data with absolute privacy assurance. The privacy defence that enables you to extract unlimited insights and train AI models while your customer’s hospital visits or pharmacy purchases never get revealed. The reason your compliance team gets out of the way, your data analysis projects start accelerating, and new customers begin lining up with their chequebooks.
But does differential privacy really deliver on these promises?
Unfortunately not. Like any technology innovation, differential privacy (DP) is susceptible to hype: fact and fiction get mixed together in all the excitement. The adoption of DP by technologists at Apple, Google, and the US Census Bureau has only amplified the buzz. The reality is, DP is in its very first stages of being used in the real world, and practitioners are still figuring out when and how to use it.
What are the questions that cut to the heart of any DP discussion, allowing you to distinguish real, immediate value from overhype? This article presents the big three.
1. What are you going to use it for?
Differential privacy is not a magic wand for all analytics use cases. Firstly, it applies only to aggregate statistics or machine learning models—DP does not permit releasing information about individual entries directly. And further, there is no general-purpose DP solution that enables all types of aggregate analysis on all types of data. Due to some inescapable mathematical truths of information theory, which one can summarise as “if you release too many statistics, you reveal all your source data,” such all-purpose tools are impossible.
But DP has been applied to some well-defined, well-scoped analytics use cases, and here it has successfully unlocked new value. These use cases have some common traits: large datasets and tolerance for approximate, not exact, statistics. And efficient DP algorithms for new use cases are being developed all the time by the research community, so DP’s real world applicability is always growing.
If you have a use case like this and need strong privacy protection, adopting DP makes sense. Which leads to the next question:
2. How will you set epsilon?
Differential privacy is actually short for “epsilon-differential privacy”. That parameter, epsilon, controls the strength of the protection that DP provides. Epsilon, somewhat confusingly, is inverse to the amount of privacy – 0.01 is very private while 100 is very un-private. The lower the epsilon, the more privacy.
Why can’t we just set epsilon to 0? Because privacy doesn’t come for free. The lower epsilon goes, the more the accuracy of the data is damaged. There’s a privacy-utility trade-off, and maximising privacy will give complete gibberish data—that is, no utility.
A better course of action is to determine what epsilon will reduce risk to an acceptable level, and select that. However, context matters – in some cases, you may want to nudge the dials more towards privacy, while in others more towards utility. It depends on what you’re protecting, who you’re showing the data to, and what other controls are in place. You want an appropriate epsilon for each context. And of course, your context may change over time.
These epsilon settings also need to be justified. If it becomes public that your customers are protected with epsilon of 5.4, will customers be relieved? Terrified? Will regulators be angry? More likely everyone will be unsure how to feel. You need a principled approach to setting epsilon and a way to justify why the epsilon you’ve chosen is safe in practical terms.
3. Is there a good privacy-utility trade-off?
As mentioned, privacy doesn’t come for free. Adopting DP can yield the best defence against many privacy attacks, but it involves limiting statistics and adding noise to them. The loss in utility needs to be worth the gain in privacy, and whether it does depends on context.
Exactly how much accuracy loss are you suffering, and does this matter? For instance, if you are calculating how much to bill your customers, any noise at all is probably unacceptable, because most people want accurate bills. This would be a bad place to use DP.
If you’re okay on the utility front, then consider: exactly how much privacy are you gaining, and does that matter? For instance, if you are showing statistics to a room full of IT staff who have access to the raw data anyway through another channel, you have not gained any real privacy by using DP on your statistics. When considering DP, ask how many people it would really prevent from getting sensitive information.
In some situations, differential privacy yields a needed increase in privacy with no real detriment to utility compared to alternative approaches. The Google RAPPOR project is a detailed, real-world example of this1. It’s in these situations that differential privacy is most valuable.
For more information on Privitar’s differential privacy offerings, contact us.
1 "Learning Statistics with Privacy, aided by the Flip of a Coin” https://ai.googleblog.com/2014/10/learning-statistics-with-privacy-aided.html