by Shih Huei Tan, Solution Architect at Privitar

What is Pseudonymization?
One of the techniques used to de-identify data is called pseudonymization. When using pseudonymization, sensitive data fields are replaced with pseudonyms to hide the identity of the individuals. Consistent pseudonymization allows identical pseudonyms to be applied to the same individual throughout the dataset. This is very useful in longitudinal studies, or for other purposes where it is necessary to link data collected at different times relating to the same data subject (the customer in this situation). Pseudonyms can also retain the structure of the original data so that the format is retained and may be useful under some circumstances. 

Why Pseudonymization Matters
Data is a valuable resource to many organizations and essential to many data driven initiatives ranging from improving customer service, driving more effective marketing campaigns, enhancing healthcare delivery, improving customer service and organizational excellence.  

Often, data that is used for these purposes may contain personally identifiable information, or primary identifiers, of customers (e.g. names, email addresses, phone numbers, social security numbers, passport numbers). These are attributes that can directly identify a person due to the nature of the information. There may also be secondary identifiers within the data that when used in isolation, may not reveal the identity of a person, but when coupled with other data points, re-identification can happen (e.g. birthdays, addresses, salary, age, job title and gender). For example, if you have an employee dataset which contains a person with a job title of Chief Executive Officer, that person’s identity will be quite obvious just based on that information without even looking at the primary identifiers.

So how do organizations use sensitive data and ensure that sufficient safeguards have been put in place to protect privacy, and also keep compliant with data protection regulations? 

Putting Pseudonymization Into Practice
Let us take the example of a bank that wants to analyze customer spending patterns over the month of June to determine their high value customers. In order to do this, they will need to use the customer transaction dataset. By looking at the dataset below, you will notice that it contains personally identifiable information such as the names, account IDs and email IDs. The analysts working with this data do not need to view these sensitive customer details in order to perform their tasks and can expose the bank to unnecessary risks and compliance issues by sharing that information. This is where pseudonymization comes in.

S/NO Name  Account ID Email ID Transaction Value  Transaction Date
1 John  AC4481245 john@gmail.com 59.45 05/06/20
2 Jenny  AC1114455 jenny@hotmail.com 12.50 07/06/20
3 Tom  AC1214445 tom@emal.com 9.50 11/06/20
4 John AC4481245 john@gmail.com 52.50 13/06/20
5 Brian AC4545553 brian@outlook.com  18.50 15/06/20
6 John AC4481245 john@gmail.com 34.50 18/06/20

 
De-identifying Data Through Pseudonymization
Below is an example of the same dataset that has been de-identified. Customer names have been pseudonymized to a string of 7 random characters so that the original names are no longer visible. Account ID and email fields have been pseudonymized consistently and therefore John (in records 1, 4 and 6) has the same values assigned to every occurrence of his record.  This will allow the analysts to find out the total transactions made by each customer because the data can be grouped together and summarised based on the account or email ID. The format preserving pseudonymized email addresses also makes it very easy to recognize that the column contains emails of customers without having to refer to the column headings.

S/NO Name  Account ID Email ID Transaction Value  Transaction Date
1 DFJFSDF X321343T idrshdy@gmail.com 59.45 05/06/20
2 LKGJSHF C125100C jfhstey@hotmail.com 12.50 07/06/20
3 LGKKGJD F454587T kfjdhsh@emal.com 9.50 11/06/20
4 FKDHWDD X321343T idrshdy@gmail.com 52.50 13/06/20
5 FKSJFJD F454587T ofhstfj@outlook.com  18.50 15/06/20
6 HSYGJEX X321343T idrshdy@gmail.com 34.50 18/06/20

 
Based on the scenario outlined above, we can see how personally identifiable information within the customer dataset has been de-identified through a process of pseudonymization. We have the option of applying it randomly or consistently, as well as making the pseudonyms retain the original format, as in the case of the email addresses.

Pseudonymization allows the privacy of the individuals within the dataset to be protected by obfuscating the identifiers, but also ensures that the information retains its utility, and enables  the data analysts to extract the necessary insights for analytical use cases.

Editor’s Note: Privitar is launching a new series focused on demystifying some foundational, but often mis-understood, elements of data privacy. This week, we’ll explore pseudonymization. Each week, we’ll dig into a new topic, defining key terminology, explaining why it is important, and how you can implement it as part of your data privacy efforts. We’ll also provide some real life examples to demonstrate the concept in action, and help readers think about use cases that they can put into practice.

Want to learn more about how pseudonymization and other forms of de-identification can help you keep your data safe and usable? Check out Privitar’s Complete Guide to Data De-Identification.