Understanding Data Generalization & Advanced De-identification Techniques

September 9, 2020

By Kish Galappatti, Data Privacy Engineer at Privitar 

Broadly, de-identification is a comprehensive set of privacy preserving techniques that enable your organization to adjust what is available to data scientists. These techniques allow you to manage risk and tune what’s available for analysis based on the analysis, model, or use. Understanding the different techniques will help you decide which techniques are correct for your use case.

What is data generalization?

Data generalization allows you to replace a data value with a less precise one using a few different techniques, which preserves data utility and protects against some types of attacks that could lead to re-identification of individuals or reveal private information unintentionally. 

Data generalization, also known as blurring, transforms one value into a more imprecise one. This can be done in various ways, including binning (where values within a range are all converted to that range), or providing a less specific value. For instance, a date of birth could be blurred to become a month of birth. A specific value, such as £14, could be expressed as a range, such as £10-£20.

There are two main forms of generalization; automated and declarative:

  • Automated generalization blurs values until it reaches a specified value of k. This option can offer the best tradeoff between privacy and accuracy, as you can use an algorithm to apply the minimum amount of distortion required to achieve the stated value of k. There are several methods to reach any value of k, so you can specify which values are of most interest for your use case, and those values are blurred least to achieve k
  • Declarative generalization allows you to specify the bin sizes up front, for example, you might always round to whole months. Sometimes this method results in discarding outliers, which can distort the data in certain ways and introduce bias. It’s also important to understand that applying declarative generalization doesn’t necessarily result in k-anonymity. Even though declarative generalization may not help you achieve k-anonymity, it’s a good practice to apply declarative generalization as a default so the recipient of the de-identified data only sees the level of detail that they require. 

Understanding identifiers

There are two main types of identifiers: direct and quasi identifiers. A direct identifier, absent any other information, can identify an individual in a dataset and allow data about that individual to be linked. However, direct identifiers may or may not be unique. For example, in the table below, customer ID, email address, and credit card number are all unique and therefore enable you to single out an individual. 

Direct Identifiers

The size of the data set matters as well. For example, in a small data set, names may be unique, but multiple individuals may share the same name in large datasets. Names are considered a direct identifier even though they’re not always unique, however, because they often allow for identification.

Quasi identifiers don’t enable you to identify an individual in a dataset on its own, but they can be used to identify individuals when combined. So quasi identifiers have two important properties: 

  1. Their combination can be unique in a dataset.
  2. Quasi identifiers are likely to be present in other available datasets (or become so in the future), which allows datasets to be linked.

In this table, only one person in the data set is male and lives in Chesapeake, therefore that combination of quasi identifiers is enough to identify him, even when additional information is removed.

Quasi Identifiers

Any individual’s name, gender, address, and ZIP code is likely to be available from other sources, such as voter registration lists. So, these pieces of data can help identify individuals. Deciding which values are direct or quasi identifiers can be challenging because it requires that you understand what data is available (or may become available in the future, which can be tricky to determine). 

In 2007, Netflix published a dataset containing the film ratings of 500,000 subscribers. Netflix believed that the data was anonymous, but researchers from the University of Texas at Austin were able to link the data with publicly available ratings from the Internet Movie Database (IMDb) to re-identify Netflix subscribers. This is an example of not correctly identifying and protecting quasi identifiers

Understanding direct and quasi identifiers gives us a baseline to talk about pseudonymous data. Pseudonymous data is data that isn’t directly identifying but can be used in conjunction with other data to identify an individual. Therefore, removing direct identifiers can (in most cases) render data pseudonymous.

Masking identifiers and more

Masking is effective at obscuring direct identifiers but used alone may be insufficient to protect against the risk of re-identification. Indeed, individuals might still be identified through unique combinations of other information known about them. For example, while most individuals have a unique combination of date of birth (DoB), zip code, and gender, there are fewer unique individuals if the zip code is clipped to include just the first few digits, the DoB information is generalized to the month or year of birth, and the gender redacted. Using multiple masking techniques, including generalization, can produce a k-anonymous output dataset. k-anonymity is a property of the dataset, where every record is indistinguishable from at least k-1 others. Take a simple dataset of name, DoB, zip code, and gender: 

Simple Dataset - Name, DoB, ZIP, Gender

Using redaction and generalization, you can turn it into a k-anonymous dataset of k=2 as follows:

k-Anonymous Dataset k=2

Or it could be further generalized and redacted to achieve k=4: 

k-Anonymous Dataset k=4

In more complex datasets, you can prioritize values when you need to be more precise using more advanced generalization algorithms. For example, if you are working with data to perform a gender pay gap analysis, you need to retain gender and generalize other details into fine-grained ranges. 

Privitar's Data Generalization Screenshot

The diagram above shows how Privitar’s unique automatic generalization capability can create clusters or bins of data by blurring indirect identifiers.

Why would we want to use data generalization?

Data generalization helps you to take personal data and abstract it, such that you take away the personally identifying attributes. This enables you to analyze the data you’re gathering without compromising the privacy of the individuals in your dataset. It’s important to note that there are different ways to generalize data, and you want to use the method that makes the most sense for your use case. Sometimes the most appropriate course is to apply masking to direct identifiers, while in other cases you want to retain signal in the analytics of data. No single approach is a silver bullet for maintaining privacy, which is why you need to understand different techniques, such as tokenization, redaction, and pseudonymization, and apply them as appropriate to maintain the greatest data utility without unduly compromising privacy.

Want to learn more about how pseudonymization and other forms of de-identification can help you keep your data safe and usable? Check out Privitar’s Complete Guide to Data De-Identification.

Plan Your Time with the In:Confidence Digital Series - Register Now

Plan Your Time with the In:Confidence Digital Series – Register Now

Registration is now open for In:Confidence Digital and we can’t wait to see you in the audience! A global pandemic doesn’t slow the collection of data down – quite the opposite, in fact, which means that now is an essential time to consider how that data is analyzed, put into use, and kept safe along the way.

Read More »
Credit Card Tokenization – When and Why Do You Need It?
Data Privacy

Credit Card Tokenization – When and Why Do You Need It?

Before we can really dive into credit card tokenization and when and why you need it, let’s start with an example. To begin with, let’s talk about a retailer implementing a loyalty program, along with accompanying customer analytics. At first glance, you might think that the credit card numbers would be the first thing to drop entirely from any analytics endeavor.

Read More »

Ready to learn more?

Our team of data privacy experts are here to answer your questions and discuss how data privacy can fuel your business.