By Kish Galappatti, Data Privacy Engineer at Privitar
Broadly, data de-identification is a comprehensive set of privacy preserving techniques that enable your organization to adjust what is available to data scientists. These techniques, such as data generalization, allow you to manage risk and tune what’s available for analysis based on the analysis, model, or use. Understanding the different techniques will help you decide which techniques are correct for your use case.
Data generalization allows you to replace a data value with a less precise one using a few different techniques, which preserves data utility and protects against some types of attacks that could lead to re-identification of individuals or reveal private information unintentionally.
Data generalization, also known as blurring, transforms one value into a more imprecise one. This can be done in various ways, including binning (where values within a range are all converted to that range), or providing a less specific value. For instance, a date of birth could be blurred to become a month of birth. A specific value, such as £14, could be expressed as a range, such as £10-£20.
There are two main forms of generalization; automated and declarative:
There are two main types of identifiers: direct and quasi identifiers. A direct identifier, absent any other information, can identify an individual in a dataset and allow data about that individual to be linked. However, direct identifiers may or may not be unique. For example, in the table below, customer ID, email address, and credit card number are all unique and therefore enable you to single out an individual.
The size of the data set matters as well. For example, in a small data set, names may be unique, but multiple individuals may share the same name in large datasets. Names are considered a direct identifier even though they’re not always unique, however, because they often allow for identification.
Quasi identifiers don’t enable you to identify an individual in a dataset on its own, but they can be used to identify individuals when combined. So quasi identifiers have two important properties:
In this table, only one person in the data set is male and lives in Chesapeake, therefore that combination of quasi identifiers is enough to identify him, even when you remove additional information.
Any individual’s name, gender, address, and ZIP code is likely to be available from other sources, such as voter registration lists. So, these pieces of data can help identify individuals. Deciding which values are direct or quasi identifiers can be challenging because it requires that you understand what data is available (or may become available in the future, which can be tricky to determine).
Understanding direct and quasi identifiers gives us a baseline to talk about pseudonymous data. Pseudonymous data is data that isn’t directly identifying but can be used in conjunction with other data to identify an individual. Therefore, removing direct identifiers can (in most cases) render data pseudonymous.
Masking is effective at obscuring direct identifiers but used alone may be insufficient to protect against the risk of re-identification. Indeed, individuals might still be identified through unique combinations of other information known about them. For example, while most individuals have a unique combination of date of birth (DoB), zip code, and gender, there are fewer unique individuals if you clip the zip coded to include just the first few digits, you generalize the DoB information to the month or year of birth, and the gender redacted. Using multiple masking techniques, including generalization, can produce a k-anonymous output dataset. k-anonymity is a property of the dataset, where every record is indistinguishable from at least k-1 others. Take a simple dataset of name, DoB, zip code, and gender:
Using redaction and generalization, you can turn it into a k-anonymous dataset of k=2 as follows:
Or it could be further generalized and redacted to achieve k=4:
In more complex datasets, you can prioritize values when you need to be more precise using more advanced generalization algorithms. For example, if you are working with data to perform a gender pay gap analysis, you need to retain gender and generalize other details into fine-grained ranges.
The diagram above shows how Privitar’s unique automatic generalization capability can create clusters or bins of data by blurring indirect identifiers.
Data generalization helps you to take personal data and abstract it, such that you take away the personally identifying attributes. This enables you to analyze the data you’re gathering without compromising the privacy of the individuals in your dataset. It’s important to note that there are different ways to generalize data, and you want to use the method that makes the most sense for your use case. Sometimes the most appropriate course is to apply masking to direct identifiers, while in other cases you want to retain signal in the analytics of data. No single approach is a silver bullet for maintaining privacy, which is why you need to understand different techniques, such as tokenization, redaction, and pseudonymization, and apply them as appropriate to maintain the greatest data utility without unduly compromising privacy.
Want to learn more about how pseudonymization and other forms of de-identification can help you keep your data safe and usable? Check out Privitar’s Complete Guide to Data De-Identification.