The Privitar platform is used to De-identify datasets containing sensitive information, so that the datasets can be more safely used by an organization.
To de-identify a source dataset involves processing the dataset to remove or obscure sensitive data in order to reduce the risk of re-identification, while still preserving utility. The utility might be for a particular data analytics use-case or simply to share the data safely with other organizations.
The source dataset that is processed would typically contain information that would be viewed as Identifiable Information, but could also include other types of sensitive information. For example, commercially-sensitive information.
The dataset is de-identified by applying various Privacy Enhancing Technologies (PETs), the choice of which is set by the context of how the data will be used. This would include factors such as: the identity of the requester, the environment in which the data will be used, the downstream usage of the data or the intended purpose (use-case) for the dataset.
De-identification often starts with analyzing the dataset for information that could be categorized as a direct identifier. A direct identifier, by itself, can identify an individual in a source dataset and can allow other datasets containing that individual to be linked with the source dataset. Linking datasets in this way may be a significant risk if it allows sensitive data to be associated with the individual, hence the importance of de-identifying direct identifiers. Examples of direct identifiers would be, a Patient ID number or an Employee ID number.
The next step in the de-identification process is to analyse the data for quasi identifiers. A quasi identifier is information that is not identifying on its own, but can become identifying when used in combination with other quasi identifiers. For example, a combination of an individual’s zip code, age and gender may be sufficient - when used together - to uniquely identify an individual.
Privitar uses a collection of PETs in order to de-identify a dataset. In Privitar, a de-identified dataset is produced by applying rules to every column in each dataset. The rule that is chosen determines the technique that is applied.
The de-identification techniques used include: redaction, tokenization, perturbation, substitution, encryption and generalization. If generalization is used, there is an option to go further and apply k-anonymity to the dataset. Generalization, using k-anonymity reduces the risk that individuals cannot be uniquely identified using combinations of quasi-identifiers, by modifying the de-identified data to ensure that a minimum number of rows exist in the dataset (k) for any such combination.
De-identifying or redacting direct identifiers will, in most cases, render the de-identified dataset pseudonymous. This means that individuals can only be identified via their quasi identifiers, either directly or by linking to an external dataset.
To de-identify datasets in Privitar, a few items need to be created or configured on the Privitar platform, as follows:
- Privitar Schema is required to describe the records in the source data. The Privitar Schema describes one or more nested tables of data and contained columns.
- Privitar Policy is a mapping of columns to a set of Rules. It defines the Rules that will be applied to every column in a dataset, in order to de-identify the dataset. Privitar Policies may be reused once created to ensure consistent treatment for specific data values. The choice of Privitar Policy (and thus the choice of Rules it is made up of) must be matched appropriately to the use-case.
- Protected Data Domain (PDD) is a logical collection of de-identified datasets, usually brought together for a specific purpose or use-case. The PDD also contains descriptive metadata about the project. Over time, more datasets can be de-identified and information about those datasets can be added to the same PDD. Each de-identified dataset associated with a PDD may, if desired, maintain referential integrity with other datasets associated with the same PDD, but Privitar reduces the risk that any linkage can be made between PDDs via their direct identifiers, by not allowing referential integrity between datasets that are in different PDDs.
- To apply the Privitar Policy to a source dataset, a Masking Job is created in Privitar. A Masking Job specifies the source dataset and the Privitar Policy that will be executed on the dataset. When the Job is run, the PDD is used to record the details of how the data has been processed, and to provide a record of the Privitar Policies and Masking Jobs that were used to process the source dataset. A Job can be run multiple times on the same dataset.
- Watermark is a unique digital stamp that is added into the records of the de-identified datasets that are defined in a specific Protected Data Domain (PDD). In the event of a data breach, this stamp can be used to trace the origin of the data back to the PDD for which it was originally produced.
- Token Vault is a secure storage that contains the tokens that are generated during de-identification of a dataset. If a Masking Job uses consistent tokenization, then a Token Vault is populated with a mapping from the original source values to the tokens produced as replacements. Using Consistent tokenization in a Job means that when multiple columns across different tables are masked, the same mapping is always used to transform values across all the tables in a dataset. Storing this mapping also makes it possible to use a controlled Unmasking process in Privitar to unmask a data column that has previously been masked.