Organisations today are entrusted with personal data that they use to serve customers and improve decision making, but a lot of the value in the data still goes untapped. This data could be invaluable to third party researchers and analysts in answering questions ranging from town planning to fighting cancer, so often organisations want to share this data, whilst protecting the privacy of individuals. However, it is also important to preserve the utility of the data to ensure accurate analytical outcomes.
Data owners want a way to transform a dataset containing highly sensitive information into a privacy-preserving, low-risk set of records that can be shared with anyone from researchers to corporate partners. Increasingly however, there have been cases of companies releasing datasets which they believed anonymised, only for a significant fraction of the records to be then re-identified. It is vital to understand how anonymisation techniques work, and to assess where they can be safely applied and their strengths and limitations.
This introduction looks at k-anonymity, a privacy model commonly applied to protect the data subjects’ privacy in data sharing scenarios, and the guarantees that k-anonymity can provide when used to anonymise data. In many privacy-preserving systems, the end goal is anonymity for the data subjects. Anonymity when taken at face value just means to be nameless, but a closer look makes it clear very quickly that only removing names from a dataset is not sufficient to achieve anonymisation. Anonymised data can be re-identified by linking data with another dataset. The data may include pieces of information that are not themselves unique identifiers, but can become identifying when combined with other datasets, these are known as quasi-identifers.
For example, around 87 percent of the US population can be uniquely identified with just their 5-digit zip code, gender, and date of birth taken together. Even in cases where only a small fraction of individuals are uniquely identifiable, it can still lead to a severe privacy breach for the individuals affected. It is never possible to know the full set of what additional information is out there, and therefore, what could be identifying.
K-anonymity is a key concept that was introduced to address the risk of re-identification of anonymised data through linkage to other datasets. The k-anonymity privacy model was first proposed in 1998 by Latanya Sweeney in her paper ‘ Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and supression‘. For k-anonymity to be achieved, there need to be at least k individuals in the dataset who share the set of attributes that might become identifying for each individual. K-anonymity might be described as a ‘hiding in the crowd’ guarantee: if each individual is part of a larger group, then any of the records in this group could correspond to a single person.
Name, Postcode, Age, and Gender are attributes that could all be used to help narrow down the record to an individual; these are considered quasi-identifiers as they could be found in other data sources. Disease is the sensitive attribute that we wish to study and which we assume the individual has an interest in keeping private.
This second table shows the data anonymised to achieve k-anonymity of k = 3, as you can see this was achieved by generalising some quasi-identifier attributes and redacting some others. In this small example the data has been distorted quite significantly, but the larger the dataset, the less distortion is required to reach the desired level of k.
While k-anonymity can provide some useful guarantees, the technique comes with the following conditions:
K-anonymisation is still a powerful tool when applied appropriately and with the right safeguards in place, such as access control and contractual safeguards. It forms an important part of the arsenal of privacy enhancing technologies, alongside alternative techniques such as differentially private algorithms. As big data becomes the norm rather than the exception, we see increasing dimensionality of data, as well as more and more public datasets that can be used to aid re-identification efforts.
With Privitar, it is now easier to use powerful privacy preserving techniques such as k-anonymisation. These de-identification tools help you know what privacy guarantees you have in place and to apply consistent privacy protections across datasets, while solving the challenge of maximising utility.