The New York Times and the MIT Technology Review recently reported on a new study that points out the significant re-identifiability risk in sensitive personal data that has been anonymized.

In the study from Nature Communications, researchers from the U.K. and Belgium said they created a model that estimates exactly how easy it is to re-identify individuals from an anonymized data set. The study even includes a way to check a re-identification score by entering one’s zip code, gender, and date of birth.

On average, the study claims, in Massachusetts, using those three attributes, one could be correctly located in an anonymized database 79.4% of the time. Given 15 demographic attributes of someone living in the U.S., there’s a 99.98% chance you could find that person in any anonymized database, according to the study. The term ‘anonymized’, in the study, means that direct identifiers have been removed and the data has been subsampled with only a random selection of the original rows kept.

The study’s troubling findings add to the accumulating mass of evidence that privacy protection efforts common today are not nearly as comprehensive or effective as they need to protect sensitive personal data. To avoid losing the customer trust and brand erosion that come with a data-privacy breach, companies should re-examine their privacy techniques to assure they are appropriately mitigating the risks of re-identification.

Using privacy techniques appropriately

There is no silver bullet when it comes to protecting sensitive personal data, and it’s unlikely that there will ever be a 100% data privacy guarantee. Techniques for anonymization are just some of many data-privacy protection tools organizations should use to provide the degree of data privacy required for the safe use of sensitive data.  

In fact, the definition of what comprises ‘anonymization’ varies markedly. For example, to qualify as anonymized data as defined in the U.K. involves looking not only at data transformation techniques, but also the environmental controls in place related to data use.

Unlike the U.K., the U.S. has no national data protection law and the term ‘anonymous’ does not have a legal meaning. Consequently, organizations can and do claim data ‘anonymization’ because it sounds good.

Anonymization techniques such as pseudonymization, which replaces direct identifiers with pseudonyms, and generalization, which reduces the specificity of values, are essential protections for sensitive data. They make it substantially more difficult to identify people in the data. However, they are not enough on their own.

Further protections can be applied, including environmental controls, workflows that bring analysis code to the data, and more advanced data level controls, like differential privacy. The right tools depend on the data, risks, analysis goals, and context of the situation. 

The environment in which data is shared can affect privacy risk. For instance, sharing data with internal data scientists, who are contractually bound not to attempt to re-identify the data, and whose actions are monitored for suspicious activity, reduces the risk of privacy attacks being attempted.

In other cases, bringing analysis code or queries to the data rather than sharing the data itself also reduces risk. This can be done even when the data is distributed across many data sources through techniques such as federated learning. 

In cases where there is a clear set of valuable statistics to share, re-identification risk can be controlled by sharing just these statistics with differentially private techniques. Differential privacy provides a robust mathematical guarantee that ensures that a data release poses very little privacy risk to the individual.

Note: Dr. Yves-Alexandre de Montjoye, one of the authors of the study, is an academic adviser to Privitar.

Jason du Preez is Privitar’s CEO