The A-Z of Privacy

Making sense of the jargon behind data protection

Protecting data can often be a daunting task. Along the way you have to navigate complex processes and get to grips with complicated jargon. We’re the first to admit, our industry can at times be confusing, especially to non-technical audiences.

But we also know how important privacy is to organizations looking to derive value from data. It’s too important to remain obscure, so to help out we’ve compiled a glossary featuring some of the most important terms in privacy engineering. We’ll keep updating the list, so make sure to check back regularly. And if you can’t find a term you’re dying to understand, get in touch and we’ll do our best to help you out.

Learn the jargon. It’s useful for your job and it’ll make you a hit at dinner parties.


A data ‘summary’ where the values of a number of rows are grouped together to form a single value, such as e.g. an average salary per department, or a sum of all sales across a category. Also often used in machine learning models.

Since they don’t identify any one individual, such data aggregates may at first sight look safe to publish. But statistical summaries can be vulnerable to reconstruction or differencing attacks, which allow an attacker to identify individuals by making multiple queries.

The word ‘anonymization’ is used to mean different things in different communities (e.g. legal and technical) and in different jurisdictions (e.g. EU and US). In the EU, ‘anonymous information’ is defined in Recital 26 of the GDPR and is out of the remit of data protection law.
To be considered anonymous, it must not be possible to identify an individual using any means reasonably likely to be used. This is assessed in terms of linkability, singling out, and inference. As such, within the EU, evaluating whether or not data is anonymous is a risk based evaluation, with anonymization being defined as a risk threshold (unlike elsewhere).
Different countries within the EU use both different ways of evaluating the threshold and different thresholds, and so there can be disagreement about what is or is not anonymous. In the US, ‘anonymous’ does not have the same legal significance, and is often used in the same way the term pseudonymous is used in the EU. Outside of the EU and US other definitions are also used.


Big data is usually characterized by its unprecedented volume (organizations can now capture, store and process much more data than they could previously), variety (e.g. it includes traditional structured data as well as new types of data sources, such as clickstreams, tweets, images, etc.), and velocity (it’s often available in real time).
Big data has opened up entirely new opportunities for organizations looking to derive insight, build new products, and make decisions based on information. But the data sets can be so complex they often challenge traditional ways of managing or processing them.
From a privacy perspective, big data has brought a host of new challenges, as it’s vulnerable to a number of new, and sophisticated attacks, such as linkage attacks, or tracker attacks. Fear of the risk associated with privacy often keeps businesses from exploiting the full potential of the big data that exists in their organization.


Encryption is the process of encoding data in such a way that only authorized parties can access it and those who are not authorized cannot. In an encryption scheme, the source information, referred to as plaintext, is encrypted using an encryption algorithm to generate ciphertext that can be read only if decrypted. It is good practice to encrypt data at rest and in transit. However, while encryption can help protect against unauthorized access, it does not protect the privacy of individuals’ data when it’s used by people who are authorized (e.g against an insider attack).
Data masking refers to a number of techniques that hide original data with random characters or data, such as tokenization, perturbation, encryption, and redaction. It produces a similar version of the data, e.g. for software development and testing, or training of ML models.
Masking maintains good data utility since it doesn’t alter anything but the identifiers. When masking data, it’s usually important to retain the complexity and patterns within the data – while masking sensitive values.

Masking is one of the most commonly used protection mechanisms for sensitive data in organizations. It protects the privacy of individuals by obscuring direct identifiers (such as name, address, account number, email, phone number etc.), thus making it impossible to look anybody up in the dataset by these identifiers.
Masking of direct identifiers alone isn’t enough when you’re trying to protect data against more sophisticated attacks (e.g. a linkage attack, where background information is used to identify individuals).
The process of turning data (usually sensitive, or personal data) into an economic benefit, or competitive advantage. Mature monetization models rely on large-scale, repeatable data operations, including consistently applied privacy policies. That’s why, for mature data monetization models that leverage sensitive data, inbuilt privacy is a prerequisite.

Privacy means all the rules and processes that are in place in an organization to manage and protect sensitive data while it’s being used by authorized staff and technology (e.g. for analytics, Test & Dev, or machine learning, etc).
The terms “Privacy” and “Security” are sometimes used synonymously, but they mean different things: Security refers to measures that protect data against unauthorized access.
Privacy and Security are complementary: organizations need to put both in place to sufficiently protect their data against breaches. Privacy technology plays a crucial role in making those data processes consistently safe, repeatable, and auditable – which is key for organizations looking to drive more value from their proprietary data.
The rules and processes that protect data against unauthorized access (such as cyber attacks). The term is often falsely used as a synonym for Privacy.
Mature data organizations are increasingly recognizing that Security and Privacy are complementary and are putting measures in place to manage both authorized and unauthorized access to data consistently across the enterprise.

A measure of how useful a data set is (e.g. for business insight, analytics, machine learning, etc). The removal of personal data and/or noise addition can make data suffer a loss of utility.

There are a number of use cases in which sensitive data isn’t actually needed to achieve valuable results – and de-identified data is just as useful (we talk about it in this blog post). Businesses sometimes need to assess the privacy/utility trade-off when making decisions on privacy policies.

An attack where an attacker can isolate an individual value by combining multiple aggregate statistics about a data set. Our research lead, Charlie Cabot, explains it in this video.
A guarantee that no one can learn anything significant about any individual from their inclusion in the data. It’s a strong way to protect privacy of aggregate statistics – such as counts and averages. Differentially private statistics are engineered such that the statistic will be similar, regardless of whether a particular user is included in the data. Typically, a system achieves differential privacy by restricting the statistics that are released and adding random noise to the statistics. Differential privacy has a parameter called epsilon, which controls the level of privacy. So long as epsilon is set appropriately, differential privacy is one of the strongest privacy guarantees available for practical use.

A value which could identify an individual on its own. Some direct identifiers are unique, such as social security numbers. Others are not unique but are highly identifying, such as name and address.


The General Data Protection Regulation is a comprehensive EU data protection law, effective from the 25th May 2018. Replacing the 1995 Data Protection Directive it brings in new rights for individuals and new responsibilities for organizations processing personal data. It also provides data protection authorities with stronger powers, including large fines and the ability to stop organizations from processing personal data. The law applies to all processing of personal data within the EU, and can apply to organizations processing personal data outside of the EU, if that data is about EU data subjects.
A method that transforms a value into a more general one.
For instance, you can generalize a number by replacing it with an interval (e.g. 33 -> 30-35), you can generalize a day/month/year by replacing it with just month/year, or you can generalize a category by replacing it with a broader category (e.g. iPhone 11 -> iPhone).


The Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule is the first comprehensive US Federal protection for the privacy of personal health information (PHI).
The HIPAA Privacy Rule establishes national standards to protect individuals’ medical records and other personal health information, and applies to health plans, health care clearinghouses, and health care providers that conduct electronic health care transactions.
A type of encryption that allows computation on encrypted data. For instance, additive homomorphic encryption enables numbers in encrypted form to be added.

Several business drivers are pushing the need for meaningful data processing that doesn’t expose sensitive information, first and foremost:

  • The desire to outsource data processing to the cloud in a way that lets processors process the data without knowing what it is.
  • The emergence of new business models built around different contributors’ proprietary data, where all contributors want to know aggregate data about the group, but no one wants to reveal their own sensitive data.
  • Homomorphic encryption can contribute towards secure multi-party computation.


An indirect identifier is a value which cannot be used to identify an individual on its own, but could be identifying if combined with other indirect identifiers. Indirect identifiers are similar to quasi-identifiers, however some organizations, including Eurostat and Privitar, distinguish between the two on the basis of how likely it is that the value could be used to re-identify an individual. For example, date of birth and postcode would be considered quasi-identifiers, as they are quite likely to be known by others and so could be used in a linkage attack. Whereas a value not widely known, for instance an internally generated value like a payment plan category, might be considered an indirect identifier.


A privacy model that’s useful to protect data in certain sharing scenarios (e.g. within an organization or with trusted partners). The issue: anonymizing direct identifiers isn’t enough if individuals may be re-identified through quasi-identifiers (such as ZIP or post code, date of birth, etc) when linked to other available data sets. k-anonymity ensures that you cannot identify a single individual from a data set, no matter what other information you hold about them.
To achieve k-anonymity, there need to be at least k other individuals in the data set that share the same identifying attributes. If that isn’t the case, k-anonymity can be achieved by generalizing the data. (e.g. rather than showing an exact birthday, the data shows a year; rather than an age of 23, a range of 20-30).


A linkage attack attempts to re-identify individuals in an anonymized dataset by combining that data with background information. The ‘linking’ uses quasi-identifiers, such as zip or postcode, gender, salary, etc that are present in both sets to establish identifying connections.
Many organizations aren’t aware of the linkage risk involving quasi-identifiers, and while they may mask direct identifiers, they often don’t think of masking, or generalizing, the quasi-identifiers. If you’d like to dive deeper, check out the blog post.
Data about an individual’s movements, often recorded by personal devices, such as smartphones and wearables. Location data poses a big privacy challenge, as it’s very personally revealing, it’s notoriously hard to anonymize, and it’s vulnerable to linkage attacks. If you know approximately where someone was at four different points in time, that’s enough to effectively reverse most location data anonymization. Differential privacy can be a useful method for protecting aggregate statistics about location data. We’ve written a blog post about it.


A type of perturbation that ensures privacy by adding ‘noise’, e.g. a random number, to a value. This slightly distorts the data, but still gives the analyst statistically useful information.


Under GDPR, “personal data” means any information that relates to an identifiable person who can be directly or indirectly identified by reference to an identifier. This means that a whole range of identifiers now constitute personal data (e.g name, identification number, location data, even a web cookie). GDPR also defines a special category of ‘sensitive personal data’ that includes genetic and biometric data.
Anonymizing such data can be extremely useful for organizations, as sufficient anonymization takes it out of the scope of regulation such as GDPR. It’s worth mentioning that sensitive data isn’t restricted to just personal information: for an organization, commercial data about transactions or financial performance can be highly sensitive, too. In the US, the term ‘Personally Identifiable Information’ (PII) is more common. And while legally, ‘personal data’ and ‘PII’ are not exact equivalents, they’re often used interchangeably.

Perturbation adds small amounts of random noise to field values or query results. It protects them against privacy attacks which rely on knowledge of specific values. Perturbation-based approaches must ensure that the noise magnitude is small enough that the valuable insights in the dataset are preserved.

Personally identifiable information, see personal data.

A data point that allows you to directly identify an individual, such as e.g, a bank card number, or a customer ID.

A term that refers to the methodologies, tools and techniques that can help manage sensitive data within an organization and provide acceptable levels of privacy. It’s an emerging discipline within software engineering. The goal of privacy engineering is to help businesses and other data-processing organizations derive value from their data while minimizing the risk that comes with analyzing, sharing, and querying sensitive information.

Protected Data Domain (PDD) is a set of managed data releases, and allows the privacy risks of that data to be evaluated and mitigated. It is the unit of data for privacy governance and management. The PDD records data lineage, permitted recipient, purpose and lifetime of the data, and what privacy protections have been applied. Data in PDDs can be watermarked, enabling traceability in the event of a data breach.

Data owners apply data protection controls to PDDs, such as pseudonymization, generalization, and differential privacy. Datasets within a PDD retain referential integrity and linkability, but are not directly linkable to another PDD. This separation enables the data owner to calculate risk scores for each PDD and to reason about the implications of publishing or sharing data.

Pseudonymization is defined in the GDPR to mean “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organizational measures to ensure that the personal data are not attributed to an identified or identifiable natural person.”
This is normally achieved by removing direct identifiers, such as a name or email address, and replacing them with a pseudonym. This process is also known as data masking or tokenization.
Unlike anonymization, pseudonymization can be set up to be reversible. While pseudonymized data remains personal data under the GDPR, the law encourages organizations to pseudonymize data whenever possible.


Also known as ‘Indirect identifier’. A piece of information that does not itself identify an individual, but can do so when combined with other quasi-identifiers (e.g. date of birth; zip or postcode; location data; salary). Quasi-identifiers are a big privacy issue, as a seemingly privacy-safe dataset can expose individuals when combined with another dataset containing information about the same individual (e.g. from the public domain).


A type of privacy attack on aggregate data that reconstructs a significant portion of a raw dataset. You can think of each aggregate statistic as an equation, where the variables represent the sensitive attributes. With enough information, the system of equations can be solved and all the sensitive attributes determined – a bit like Sudoku. Theresa Stadler, one of Privitar’s research scientists, explains it in this video.

A data masking technique that deletes all or part of the field value. Often, redaction will retain high level information, but remove detail, such as retaining only the first part of a postcode (SE1 8RT ⇝ SE1), or the last four digits of a credit card number.


A technology that allows several parties to collaborate and compute a function, while keeping their respective inputs private.
In a business context, this has been a huge advance for businesses looking to collaborate on their data (e.g. to build new data products), without having to share raw datasets. Secure multi-party computation is also useful in cases where regulations or trust-levels don’t allow parties to share data between parties – e.g. across country borders. Often uses methods from homomorphic encryption.

A data provisioning system that lets a large number of data users in an organization access data for analytics, without needing to request access from IT for every query. This includes non-traditional data users, e.g. from the line of business, or so-called “citizen analysts”. It’s a huge step for organizations trying to break down data silos and derive value, insight, and competitive advantage from their data.

Such a central repository that includes rich and sensitive business data (e.g. on customers, transactions, financial performance) is highly vulnerable to privacy attacks. For an effective self-service analytics system to operate privacy-safely and at scale, its in-built data protection mechanisms need to be centrally managed and apply consistent privacy policies across all data. Also referred to as “Data-as-a-Service”.


Data that’s used in a non-production environment to test an application before moving it into production.
It can be quite hard to synthetically generate data that’s realistic and rich enough to confidently test for all use cases, so quite a few organizations still use raw data in their Test & Dev environments. This poses a big privacy risk.

A data masking technique that replaces the field value with a ‘token’, a synthetic value that stands in for the real value. The pattern for the generated token is configurable and can be chosen to be of the same format as the source data to preserve data formats (e.g. for testing and development).

Tokenization can also be done consistently, meaning that the same value is always replaced with the same token, such that referential integrity is preserved in the dataset.

A tracker attack is where an attacker can isolate an individual customer by making multiple aggregate queries (e.g. AVG() and SUM() in SQL) on a data set. Most aggregate query interfaces are naïve and only block queries with small result set sizes. This feature alone is insufficient to preserve privacy.


The act of embedding a pattern into a privacy-safe data set which is hard to detect or remove. It makes it possible to trace published data back to its source (e.g. in case of a data breach). It also acts as an additional deterrent to sharing data outside its intended use. Watermarks are a feature of the Privitar Privacy Platform.

Ready to learn more?

Our team of data privacy experts are here to answer your questions and discuss how data privacy can fuel your business.