By David Millman, Principal Sales Engineer at PrivitarIncreases in data-driven insights have led many enterprises to begin trying to leverage their data to improve the customer experience, strengthen their brand image, accelerate innovation and decrease time to market. With these data-driven insights come increased regulations to protect customer privacy, and with them, an increased focus on data privacy. Ensuring that data is properly protected is key to enabling organizations to make the most of their data. But what’s the best way to do this? Typically, tokenization is the main data masking approach when obfuscating primary identifiers within a dataset, leading many people to believe this is the only required technique. The challenge with this thinking is that the data is still at risk but data owners and executives believe they are safe, when nothing is further from the truth. To obtain data privacy, a combination of the following techniques needs to be applied. These techniques go beyond tokenization to truly hide an individual and minimize compliance risk. The challenge, and therefore the goal, is not just to create anonymized data, but to also retain its value while allowing more stakeholders internally and externally to benefit from it. To demonstrate why these are important, it is essential to look at the requirements that data is putting on the organization and how to address these.
Starting PointLet’s start with the simplest case, as shown in Figure 2, where most privacy tools seem to meet the minimum requirements. Here, people from two different parts of an organization, HR and Operations, each have access to a specific dataset. As the dataset Access Control Lists (ACLs) directly map to those in the organization’s Active Directory, it is easy to restrict access appropriately. With tokenization as a data masking technique, it is possible to seemingly restrict an individual’s understanding of any particular record.
Figure 2 The simple situation most privacy solutions are built forUnfortunately, the risk is still there, despite outward appearances. Let’s look at the issues in more detail:
- Trusted employees can become the risk
- Individuals can still be identified
- Records can still be linked to other data
- Trusted Employees Can Become the Risk
Figure 3 The biggest threat is trusted employee
Individuals Can Still Be IdentifiedMost tokenization schemes concentrate on key identifiers, such as first-name, last-name, address, social security number, email address. These schemes do not take into account quasi- identifiers (i.e., those attributes that when combined can be used to identify an individual). Compound identifiers can, through creating a compound key such as transaction/ purchase amount, store id or other component, return a single record. There are a couple of techniques that need to be applied to eliminate this risk of identification:
- K-anonymity: often used for healthcare and public datasets to ensure that for no query can fewer than a quantity of k records be returned for a given set of data fields.
- Perturbation: adds a small amount of noise, such as adding/removing a small value up to a threshold in a column of data, to de-identify the individual transaction, while keeping the overall data statistically correct.
- Tokenization and general obfuscation techniques can be applied to individual records, as they don’t rely on knowledge about any other records. Calculating k-anonymity and perturbing data requires more processing power to achieve the desired result while remaining statistically accurate.
Records Can Still Be LinkedTo feed an exponential innovation and growth curve, organizations must combine siloed data to maximize utility and sharpen insights. This often involves breaking down traditional organizational silos by creating new ad-hoc group memberships. A simple example of this is shown in Figure 4, where a subset of the two groups HR and Operations are working together to create a new set of data products.
Figure 4 Combine people across organizational boundaries to form new groups that spark innovationThe privacy team is challenged with the following:
- Ensuring that the subset of HR and Operations employees working together can access only the privatized data in the combined dataset. The ever growing matrix of datasets, views, permissions for particular contexts provide an increasing challenge for defining, enforcing and auditing becomes a managerial headache.
- The combined dataset is not available to those HR and Operations employees who should not have access.
- Tokenization of individual datasets eliminates the threat of linkage attacks, where it would be possible to identify an individual by creating new joins across different datasets.