Pseudonymization 101

June 22, 2020

by Shih Huei Tan, Solution Architect at Privitar

Editor’s Note: Privitar is launching a new series focused on demystifying some foundational, but often mis-understood, elements of data privacy. This week, we’ll explore pseudonymization. Each week, we’ll dig into a new topic, defining key terminology, explaining why it is important, and how you can implement it as part of your data privacy efforts. We’ll also provide some real life examples to demonstrate the concept in action, and help readers think about use cases that they can put into practice.

What is Pseudonymization?
One of the techniques used to de-identify data is called pseudonymization. When using pseudonymization, sensitive data fields are replaced with pseudonyms to hide the identity of the individuals. Consistent pseudonymization allows identical pseudonyms to be applied to the same individual throughout the dataset. This is very useful in longitudinal studies, or for other purposes where it is necessary to link data collected at different times relating to the same data subject (the customer in this situation). Pseudonyms can also retain the structure of the original data so that the format is retained and may be useful under some circumstances. 

Why Pseudonymization Matters
Data is a valuable resource to many organizations and essential to many data driven initiatives ranging from improving customer service, driving more effective marketing campaigns, enhancing healthcare delivery, improving customer service and organizational excellence.  

Often, data that is used for these purposes may contain personally identifiable information, or primary identifiers, of customers (e.g. names, email addresses, phone numbers, social security numbers, passport numbers). These are attributes that can direclty identify a person due to the nature of the information. There may also be secondary identifiers within the data that when used in isolation, may not reveal the identity of a person, but when coupled with other data points, re-identification can happen (e.g. birthdays, addresses, salary, age, job title and gender). For example, if you have an employee dataset which contains a person with a job title of Chief Executive Officer, that person’s identity will be quite obvious just based on that information without even looking at the primary identifiers.

So how do organizations use sensitive data and ensure that sufficient safeguards have been put in place to protect privacy, and also keep compliant with data protection regulations? 

Putting Pseudonymization Into Practice
Let us take the example of a bank that wants to analyze customer spending patterns over the month of June to determine their high value customers. In order to do this, they will need to use the customer transaction dataset. By looking at the dataset below, you will notice that it contains personally identifiable information such as the names, account IDs and email IDs. The analysts working with this data do not need to view these sensitive customer details in order to perform their tasks and can expose the bank to unnecessary risks and compliance issues by sharing that information. This is where pseudonymization comes in.

S/NO

Name 

Account ID

Email ID

Transaction Value 

Transaction Date

1

John 

AC4481245

john@gmail.com

59.45

05/06/20

2

Jenny 

AC1114455

jenny@hotmail.com

12.50

07/06/20

3

Tom 

AC1214445

tom@emal.com

9.50

11/06/20

4

John

AC4481245

john@gmail.com

52.50

13/06/20

5

Brian

AC4545553

brian@outlook.com 

18.50

15/06/20

6

John

AC4481245

john@gmail.com

34.50

18/06/20


De-identifying Data Through Pseudonymization
Below is an example of the same dataset that has been de-identified. Customer names have been pseudonymized to a string of 7 random characters so that the original names are no longer visible. Account ID and email fields have been pseudonymized consistently and therefore John (in records 1, 4 and 6) has the same values assigned to every occurrence of his record.  This will allow the analysts to find out the total transactions made by each customer because the data can be grouped together and summarised based on the account or email ID. The format preserving pseudonymized email addresses also makes it very easy to recognize that the column contains emails of customers without having to refer to the column headings.

S/NO

Name 

Account ID

Email ID

Transaction Value 

Transaction Date

1

DFJFSDF

X321343T

idrshdy@gmail.com

59.45

05/06/20

2

LKGJSHF

C125100C

jfhstey@hotmail.com

12.50

07/06/20

3

LGKKGJD

F454587T

kfjdhsh@emal.com

9.50

11/06/20

4

FKDHWDD

X321343T

idrshdy@gmail.com

52.50

13/06/20

5

FKSJFJD

F454587T

ofhstfj@outlook.com 

18.50

15/06/20

6

HSYGJEX

X321343T

idrshdy@gmail.com

34.50

18/06/20

Based on the scenario outlined above, we can see how personally identifiable information within the customer dataset has been de-identified through a process of pseudonymization. We have the option of applying it randomly or consistently, as well as making the pseudonyms retain the original format, as in the case of the email addresses.

Pseudonymization allows the privacy of the individuals within the dataset to be protected by obfuscating the identifiers, but also ensures that the information retains its utility, and enables  the data analysts to extract the necessary insights for analytical use cases.

Want to learn more about how pseudonymization and other forms of de-identification can help you keep your data safe and usable? Check out Privitar’s Complete Guide to Data De-Identification.

Ready to learn more?

Our team of data privacy experts are here to answer your questions and discuss how data privacy can fuel your business.