Universal security and privacy automation
Protect data and manage risk
Analyze conversational chat data
Reduce the time and cost to comply
Self-service without friction or delay
Align data protection and business use
Tailor access controls and data privacy
Flexible, consistent, scalable
Automate actionable compliance steps
Who we integrate with
Our professional services
Power responsible use
From clinical to commercial
Optimize data tests
Open new revenue streams
Realize the potential of the cloud
Protect data from misuse
Transform your data
Opinion and industry insights
An A to Z of the industry
The podcast for data leaders
The latest compliance news and advice
Press releases, awards, and more
Staying at the cutting edge
The team behind Privitar
A thriving partner ecosystem
Our story, values, and careers
Dedicated customer assistance
Jun 22, 2020
by Shih Huei Tan, Solution Architect at Privitar
What is Pseudonymization?
One of the techniques used to de-identify data is called pseudonymization. When using pseudonymization, sensitive data fields are replaced with pseudonyms to hide the identity of the individuals. Consistent pseudonymization allows identical pseudonyms to be applied to the same individual throughout the dataset. This is very useful in longitudinal studies, or for other purposes where it is necessary to link data collected at different times relating to the same data subject (the customer in this situation). Pseudonyms can also retain the structure of the original data so that the format is retained and may be useful under some circumstances.
Why Pseudonymization Matters
Data is a valuable resource to many organizations and essential to many data driven initiatives ranging from improving customer service, driving more effective marketing campaigns, enhancing healthcare delivery, improving customer service and organizational excellence.
Often, data that is used for these purposes may contain personally identifiable information, or primary identifiers, of customers (e.g. names, email addresses, phone numbers, social security numbers, passport numbers). These are attributes that can directly identify a person due to the nature of the information. There may also be secondary identifiers within the data that when used in isolation, may not reveal the identity of a person, but when coupled with other data points, re-identification can happen (e.g. birthdays, addresses, salary, age, job title and gender). For example, if you have an employee dataset which contains a person with a job title of Chief Executive Officer, that person’s identity will be quite obvious just based on that information without even looking at the primary identifiers.
So how do organizations use sensitive data and ensure that sufficient safeguards have been put in place to protect privacy, and also keep compliant with data protection regulations?
Putting Pseudonymization Into Practice
Let us take the example of a bank that wants to analyze customer spending patterns over the month of June to determine their high value customers. In order to do this, they will need to use the customer transaction dataset. By looking at the dataset below, you will notice that it contains personally identifiable information such as the names, account IDs and email IDs. The analysts working with this data do not need to view these sensitive customer details in order to perform their tasks and can expose the bank to unnecessary risks and compliance issues by sharing that information. This is where pseudonymization comes in.
De-identifying Data Through Pseudonymization
Below is an example of the same dataset that has been de-identified. Customer names have been pseudonymized to a string of 7 random characters so that the original names are no longer visible. Account ID and email fields have been pseudonymized consistently and therefore John (in records 1, 4 and 6) has the same values assigned to every occurrence of his record. This will allow the analysts to find out the total transactions made by each customer because the data can be grouped together and summarised based on the account or email ID. The format preserving pseudonymized email addresses also makes it very easy to recognize that the column contains emails of customers without having to refer to the column headings.
Based on the scenario outlined above, we can see how personally identifiable information within the customer dataset has been de-identified through a process of pseudonymization. We have the option of applying it randomly or consistently, as well as making the pseudonyms retain the original format, as in the case of the email addresses.
Pseudonymization allows the privacy of the individuals within the dataset to be protected by obfuscating the identifiers, but also ensures that the information retains its utility, and enables the data analysts to extract the necessary insights for analytical use cases.
Editor’s Note: Privitar is launching a new series focused on demystifying some foundational, but often mis-understood, elements of data privacy. This week, we’ll explore pseudonymization. Each week, we’ll dig into a new topic, defining key terminology, explaining why it is important, and how you can implement it as part of your data privacy efforts. We’ll also provide some real life examples to demonstrate the concept in action, and help readers think about use cases that they can put into practice.
Want to learn more about how pseudonymization and other forms of de-identification can help you keep your data safe and usable? Check out Privitar’s Complete Guide to Data De-Identification.
Our experts are ready to answer your questions and discuss how Privitar’s security and privacy solutions can fuel your efficiency, innovation, and business growth.