The general name given to a set of data privacy techniques that hides or obscures the values in a dataset by replacing the original values with modified content (characters or other data). De-identification techniques include: masking, tokenization, perturbation, encryption, and redaction.

Using these techniques typically produces a similar version of the data that is suitable for software development and testing, or training of machine learning models. It maintains good data utility since it doesn’t alter anything but the identifiers. It’s usually important to retain the complexity and patterns within the data – while de-identifying sensitive values.

De-identification is one of the most commonly used protection mechanisms for sensitive data in organizations. It protects the privacy of individuals by obscuring direct identifiers thus making it impossible to look anybody up in the dataset using these identifiers. De-identifying direct identifiers alone isn’t enough when you’re trying to protect data against more sophisticated attacks, such as a linkage attack, where background information is used to identify individuals.

Return to glossary