By David Millman, Principal Sales Engineer at Privitar

Increases in data-driven insights have led many enterprises to begin trying to leverage their data to improve the customer experience, strengthen their brand image, accelerate innovation and decrease time to market. With these data-driven insights come increased regulations to protect customer privacy, and with them, an increased focus on data privacy. Ensuring that data is properly protected is key to enabling organizations to make the most of their data. But what’s the best way to do this?

Typically, tokenization is the main data masking approach when obfuscating primary identifiers within a dataset, leading many people to believe this is the only required technique. The challenge with this thinking is that the data is still at risk but data owners and executives believe they are safe, when nothing is further from the truth.

To obtain data privacy, a combination of the following techniques needs to be applied. These techniques go beyond tokenization to truly hide an individual and minimize compliance risk. The challenge, and therefore the goal, is not just to create anonymized data, but to also retain its value while allowing more stakeholders internally and externally to benefit from it.

To demonstrate why these are important, it is essential to look at the requirements that data is putting on the organization and how to address these.

Starting Point

Let’s start with the simplest case, as shown in Figure 2, where most privacy tools seem to meet the minimum requirements. Here, people from two different parts of an organization, HR and Operations, each have access to a specific dataset.

As the dataset Access Control Lists (ACLs) directly map to those in the organization’s Active Directory, it is easy to restrict access appropriately. With tokenization as a data masking technique, it is possible to seemingly restrict an individual’s understanding of any particular record.

Figure 2 The simple situation most privacy solutions are built for

Unfortunately, the risk is still there, despite outward appearances. Let’s look at the issues in more detail:

  • Trusted employees can become the risk
  • Individuals can still be identified
  • Records can still be linked to other data
  • Trusted Employees Can Become the Risk

Innovation is all about allowing the data to be used in new and interesting ways, and this ultimately requires more people to have more access. In many organizations, data is being made available to employees, partners and other third parties to scale up innovation and discover new insights. With increased exposure of the data, however, comes increased risk. In the healthcare industry, for example, 60% of data breaches are due to inside actors.

Data going outside of the firewall is not always a malicious attack. For example, an employee or consultant could download a subset of data to their laptop so they can work at home. An email comes across asking for a file, and without checking a copy of the data has been added to the email and sent somewhere that quickly ends up on the internet.

Figure 3 The biggest threat is trusted employee


Individuals Can Still Be Identified

Most tokenization schemes concentrate on key identifiers, such as first-name, last-name, address, social security number, email address.  These schemes do not take into account quasi- identifiers (i.e., those attributes that when combined can be used to identify an individual). Compound identifiers can, through creating a compound key such as transaction/ purchase amount, store id or other component, return a single record.  There are a couple of techniques that need to be applied to eliminate this risk of identification:

  • K-anonymity: often used for healthcare and public datasets to ensure that for no query can fewer than a quantity of k records be returned for a given set of data fields.
  • Perturbation: adds a small amount of noise, such as adding/removing a small value up to a threshold in a column of data, to de-identify the individual transaction, while keeping the overall data statistically correct.
  • Tokenization and general obfuscation techniques can be applied to individual records, as they don’t rely on knowledge about any other records.  Calculating k-anonymity and perturbing data requires more processing power to achieve the desired result while remaining statistically accurate.

Records Can Still Be Linked

To feed an exponential innovation and growth curve, organizations must combine siloed data to maximize utility and sharpen insights. This often involves breaking down traditional organizational silos by creating new ad-hoc group memberships. A simple example of this is shown in Figure 4, where a subset of the two groups HR and Operations are working together to create a new set of data products.

Figure 4 Combine people across organizational boundaries to form new groups that spark innovation

The privacy team is challenged with the following:

  • Ensuring that the subset of HR and Operations employees working together can access only the privatized data in the combined dataset. The ever growing matrix of datasets, views, permissions for particular contexts provide an increasing challenge for defining, enforcing and auditing becomes a managerial headache.
  • The combined dataset is not available to those HR and Operations employees who should not have access.
  • Tokenization of individual datasets eliminates the threat of linkage attacks, where it would be possible to identify an individual by creating new joins across different datasets. 

All of these are important to ensure that there is no way for an HR employee to uniquely cross-reference data between the combined and HR datasets.  The threat surface increases with the number of groups, departments, locations and employees, further leading to the challenge of auditing and managing the privacy at scale.


It is inevitable that data will leave the organization, either by design, by accident or by malice. The challenge for every organization is to demonstrate that any data a person has access to is uniquely traceable, both as a deterrent to insider threats and as a demonstration to organizational and governmental bodies that reasonable security measures are in place. Additionally, if any data does get into the public domain it is important to be able to determine the source. As part of the data privacy techniques previously mentioned, Privitar can place a unique Watermark that allows a dataset or subset thereof to be quickly identified. This ensures an organization can trace its data and how it is used. 


Data privacy is a fundamental requirement for every organization and must be available on an ever-increasing number of platforms: on-premise, cloud, multi-cloud and hybrid.  Tokenization and obfuscation work well for a single record that is accessed by the primary identifiers, but a true privacy plan requires obfuscation techniques that look at all the rows and appropriate columns to apply a comprehensive data privacy strategy.  The end result is managed copies of data, where the security, auditing, etc. is easy to describe and control, all with an underlying tokenization policy that eliminates the threat of linkage attacks, by providing unique tokens per dataset.

If, in the worst-case scenario, data does get into the wrong hands, it must first of all remain private (i.e., the data does not identify an individual).  Moreover, technology and processes must be in place for an organization to identify the source of data and identify the individual(s) involved in the leak, to determine the cause and the underlying reasons. Data privacy techniques used in combination allow data to be used to maximize competitive insights, enable innovation, improve the customer experience and decrease time to market; enterprises committed to integrating data privacy into their business processes must commit to fully embedding techniques into their operations to reduce risk and enable teams to be successful.