Privitar Glossary

The A-Z of Privacy

Making sense of the jargon behind data protection

 
Protecting data can often be a daunting task. Along the way you have to navigate complex processes and get to grips with complicated jargon. We’re the first to admit, our industry can at times be confusing, especially to non-technical audiences.

But we also know how important privacy is to organisations looking to derive value from data. It's too important to remain obscure, so to help out we've compiled a glossary featuring some of the most important terms in privacy engineering. We'll keep updating the list, so make sure to check back regularly. And if you can't find a term you're dying to understand, get in touch and we'll do our best to help you out.

Learn the jargon. It’s useful for your job and it’ll make you a hit at dinner parties.

Aggregate data
A data 'summary' where the values of a number of rows are grouped together to form a single value, such as e.g. an average salary per department, or a sum of all sales across a category. Also often used in machine learning models. Since they don't identify any one individual, such data aggregates may at first sight look safe to publish. But statistical summaries can be vulnerable to reconstruction or differencing attacks, which allow an attacker to identify individuals by making multiple queries.

See also:

Differencing attack,Noise addition,Reconstruction attack

Anonymisation
The word ‘anonymisation’ is used to mean different things in different communities (e.g. legal and technical) and in different jurisdictions (e.g. EU and US). In the EU, ‘anonymous information’ is defined in Recital 26 of the GDPR and is out of the remit of data protection law. To be considered anonymous, it must not be possible to identify an individual using any means reasonably likely to be used. This is assessed in terms of linkability, singling out, and inference. As such, within the EU, evaluating whether or not data is anonymous is a risk based evaluation, with anonymisation being defined as a risk threshold (unlike elsewhere). Different countries within the EU use both different ways of evaluating the threshold and different thresholds, and so there can be disagreement about what is or is not anonymous. In the US, ‘anonymous’ does not have the same legal significance, and is often used in the same way the term pseudonymous is used in the EU. Outside of the EU and US other definitions are also used.

See also:

Pseudonymisation,GDPR,Personal data,Data monetisation

Big Data
Big data is usually characterised by its unprecedented volume (organisations can now capture, store and process much more data than they could previously), variety (e.g. it includes traditional structured data as well as new types of data sources, such as clickstreams, tweets, images, etc.), and velocity (it's often available in real time). Big data has opened up entirely new opportunities for organisations looking to derive insight, build new products, and make decisions based on information. But the data sets can be so complex they often challenge traditional ways of managing or processing them. From a privacy perspective, big data has brought a host of new challenges, as it's vulnerable to a number of new, and sophisticated attacks, such as linkage attacks, or tracker attacks. Fear of the risk associated with privacy often keeps businesses from exploiting the full potential of the big data that exists in their organisation.

See also:

Linkage attack,Tracker attack,Privacy,Security

Data Encryption
Encryption is the process of encoding data in such a way that only authorised parties can access it and those who are not authorised cannot. In an encryption scheme, the source information, referred to as plaintext, is encrypted using an encryption algorithm to generate ciphertext that can be read only if decrypted. It is good practice to encrypt data at rest and in transit. However, while encryption can help protect against unauthorised access, it does not protect the privacy of individuals' data when it's used by people who are authorised (e.g against an insider attack).

See also:

Re-identification

Data Masking
Data masking refers to a number of techniques that hide original data with random characters or data, such as tokenisation, perturbation, encryption, and redaction. It produces a similar version of the data, e.g. for software development and testing, or training of ML models. Masking maintains good data utility since it doesn't alter anything but the identifiers. When masking data, it's usually important to retain the complexity and patterns within the data - while masking sensitive values.

Masking is one of the most commonly used protection mechanisms for sensitive data in organisations. It protects the privacy of individuals by obscuring direct identifiers (such as name, address, account number, email, phone number etc.), thus making it impossible to look anybody up in the dataset by these identifiers. However, masking of direct identifiers alone isn't enough when you're trying to protect data against more sophisticated attacks (e.g. a linkage attack, where background information is used to identify individuals).

See also:

Linkage attacks,Tokenisation,Redaction,Encryption,Perturbation,Data utility

Data Monetisation
The process of turning data (usually sensitive, or personal data) into an economic benefit, or competitive advantage. Mature monetisation models rely on large-scale, repeatable data operations, including consistently applied privacy policies. That's why, for mature data monetisation models that leverage sensitive data, inbuilt privacy is a prerequisite.

Data Perturbation
Perturbation adds small amounts of random noise to field values or query results. It protects them against privacy attacks which rely on knowledge of specific values. Perturbation-based approaches must ensure that the noise magnitude is small enough that the valuable insights in the dataset are preserved.

See also:

Noise Addition, Differential Privacy

Data Utility
A measure of how useful a data set is (e.g. for business insight, analytics, machine learning, etc). The removal of personal data and/or noise addition can make data suffer a loss of utility. But there are a number of use cases in which sensitive data isn't actually needed to achieve valuable results - and de-identified data is just as useful (we talk about it in this blog post). Businesses sometimes need to assess the privacy/utility trade-off when making decisions on privacy policies.

Data-as-a-Service (DaaS)

See also:

Self-service Analytics

Differencing Attack
An attack where an attacker can isolate an individual value by combining multiple aggregate statistics about a data set. Our research lead, Charlie Cabot, explains it in this video (see transcript here)

See also:

Noise Addition

Direct Identifier
A value which could identify an individual on its own. Some direct identifiers are unique, such as social security numbers. Others are not unique but are highly identifying, such as name and address.

Differential Privacy
A guarantee that no one can learn anything significant about any individual from their inclusion in the data. It's a strong way to protect privacy of aggregate statistics - such as counts and averages. Differentially private statistics are engineered such that the statistic will be similar, regardless of whether a particular user is included in the data. Typically, a system achieves differential privacy by restricting the statistics that are released and adding random noise to the statistics.

Differential privacy has a parameter called epsilon, which controls the level of privacy. So long as epsilon is set appropriately, differential privacy is one of the strongest privacy guarantees available for practical use.

See also:

Noise Addition,Aggregate Data,Reconstruction Attack

GDPR
The General Data Protection Regulation is a comprehensive EU data protection law, effective from the 25th May 2018. Replacing the 1995 Data Protection Directive it brings in new rights for individuals and new responsibilities for organisations processing personal data. It also provides data protection authorities with stronger powers, including large fines and the ability to stop organisations from processing personal data. The law applies to all processing of personal data within the EU, and can apply to organisations processing personal data outside of the EU, if that data is about EU data subjects.

Generalisation
A method that transforms a value into a more general one. For instance, you can generalise a number by replacing it with an interval (e.g. 33 -> 30-35), you can generalise a day/month/year by replacing it with just month/year, or you can generalise a category by replacing it with a broader category (e.g. iPhone 6s -> iPhone).

See also:

Data Utility,K-anonymity

Homomorphic Encryption
A type of encryption that allows computation on encrypted data. For instance, additive homomorphic encryption enables numbers in encrypted form to be added. Several business drivers are pushing the need for meaningful data processing that doesn't expose sensitive information, first and foremost:
- The desire to outsource data processing to the cloud in a way that lets processors process the data without knowing what it is
- The emergence of new business models built around different contributors' proprietary data, where all contributors want to know aggregate data about the group, but no one wants to reveal their own sensitive data.
Homomorphic encryption can contribute towards secure multi-party computation.

See also:

Data Encryption,Secure Multi-party Computation

Indirect Identifier
An indirect identifier is a value which cannot be used to identify an individual on its own, but could be identifying if combined with other indirect identifiers. Indirect identifiers are similar to quasi-identifiers, however some organisations, including Eurostat and Privitar, distinguish between the two on the basis of how likely it is that the value could be used to re-identify an individual. For example, date of birth and postcode would be considered quasi-identifiers, as they are quite likely to be known by others and so could be used in a linkage attack. Whereas a value not widely known, for instance an internally generated value like a payment plan category, might be considered an indirect identifier.

See also:

Quasi-identifier

K-anonymity
A privacy model that's useful to protect data in certain sharing scenarios (e.g. within an organisation or with trusted partners). The issue: anonymising direct identifiers isn't enough if individuals may be re-identified through quasi-identifiers (such as ZIP or post code, date of birth, etc) when linked to other available data sets. k-anonymity ensures that you cannot identify a single individual from a data set, no matter what other information you hold about them. To achieve k-anonymity, there need to be at least k other individuals in the data set that share the same identifying attributes. If that isn't the case, k-anonymity can be achieved by generalising the data. (e.g. rather than showing an exact birthday, the data shows a year; rather than an age of 23, a range of 20-30).

See also:

Generalisation,Quasi-identifier

Linkage Attack
A linkage attack attempts to re-identify individuals in an anonymised dataset by combining that data with background information. The 'linking' uses quasi-identifiers, such as postcode, gender, salary, etc that are present in both sets to establish identifying connections. Many organisations aren't aware of the linkage risk involving quasi-identifiers, and while they may mask direct identifiers, they often don't think of masking, or generalising, the quasi-identifiers. If you'd like to dive deeper, check out the blog post.

See also:

K-anonymity,Generalisation,Tracker Attack,Quasi-identifier

Location Data
Data about an individual's movements, often recorded by personal devices, such as smartphones and wearables. Location data poses a big privacy challenge, as it's very personally revealing, it's notoriously hard to anonymise, and it's vulnerable to linkage attacks. If you know approximately where someone was at four different points in time, that’s enough to effectively reverse most location data anonymisation. Differential privacy can be a useful method for protecting aggregate statistics about location data. We've written a blog post about it.

See also:

Linkage Attack

Noise Addition
A type of perturbation that ensures privacy by adding 'noise', e.g. a random number, to a value. This slightly distorts the data, but still gives the analyst statistically useful information.

See also:

Data Perturbation,Differencing Attack

Personal Data
Under GDPR, “personal data” means any information that relates to an identifiable person who can be directly or indirectly identified by reference to an identifier. This means that a whole range of identifiers now constitute personal data (e.g name, identification number, location data, even a web cookie). GDPR also defines a special category of 'sensitive personal data' that includes genetic and biometric data.
Anonymising such data can be extremely useful for organisations, as sufficient anonymisation takes it out of the scope of regulation such as GDPR. It's worth mentioning that sensitive data isn't restricted to just personal information: for an organisation, commercial data about transactions or financial performance can be highly sensitive, too. In the US, the term 'Personally Identifiable Information' (PII) is more common. And while legally, 'personal data' and 'PII' are not exact equivalents, they're often used interchangeably.

See also:

PII

PII
Personally identifiable information, see personal data

See also:

Personal Data

Primary Identifier
A data point that allows you to directly identify an individual, such as e.g, a bank card number, or a customer ID.

Privacy
Privacy means all the rules and processes that are in place in an organisation to manage and protect sensitive data while it's being used by authorised staff and technology (e.g. for analytics, Test & Dev, or machine learning, etc). The terms "Privacy" and "Security" are sometimes used synonymously, but they mean different things: Security refers to measures that protect data against unauthorised access. Privacy and Security are complementary: organisations need to put both in place to sufficiently protect their data against breaches. Privacy technology plays a crucial role in making those data processes consistently safe, repeatable, and auditable - which is key for organisations looking to drive more value from their proprietary data.

See also:

Security

Privacy Engineering
A term that refers to the methodologies, tools and techniques that can help manage sensitive data within an organisation and provide acceptable levels of privacy. It's an emerging discipline within software engineering. The goal of privacy engineering is to help businesses and other data-processing organisations derive value from their data while minimising the risk that comes with analysing, sharing, and querying sensitive information.

Privacy Policy
The rules that define how sensitive information can be used and circulated within an organisation.

Protected Data Domain (PDD)
A Protected Data Domain (PDD) is a set of managed data releases, and allows the privacy risks of that data to be evaluated and mitigated. It is the unit of data for privacy governance and management. The PDD records data lineage, permitted recipient, purpose and lifetime of the data, and what privacy protections have been applied. Data in PDDs can be watermarked, enabling traceability in the event of a data breach.

Data owners apply data protection controls to PDDs, such as pseudonymisation, generalisation, and differential privacy. Datasets within a PDD retain referential integrity and linkability, but are not directly linkable to another PDD. This separation enables the data owner to calculate risk scores for each PDD and to reason about the implications of publishing or sharing data.

Pseudonymisation
Pseudonymisation is defined in the GDPR to mean “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person” This is normally achieved by removing direct identifiers, such as a name or email address, and replacing them with a pseudonym. This process is also known as data masking or tokenisation. Unlike anonymisation, pseudonymisation can be set up to be reversible. While pseudonymised data remains personal data under the GDPR, the law encourages organisations to pseudonymise data whenever possible.

See also:

Anonymisation,Tokenisation,Personal Data,GDPR

Quasi-identifier
Also known as 'Indirect identifier'. A piece of information that does not itself identify an individual, but can do so when combined with other quasi-identifiers (e.g. date of birth; postcode; location data; salary). Quasi-identifiers are a big privacy issue, as a seemingly privacy-safe dataset can expose individuals when combined with another dataset containing information about the same individual (e.g. from the public domain).

See also:

Linkage Attack

Quasi-identifier

Reconstruction Attack
A type of privacy attack on aggregate data that reconstructs a significant portion of a raw dataset. You can think of each aggregate statistic as an equation, where the variables represent the sensitive attributes. With enough information, the system of equations can be solved and all the sensitive attributes determined - a bit like Sudoku. Theresa Stadler, one of Privitar's research scientists, explains it in this video:

Redaction
A data masking technique that deletes all or part of the field value. Often, redaction will retain high level information, but remove detail, such as retaining only the first part of a postcode (SE1 8RT ⇝ SE1), or the last four digits of a credit card number.

Secure Multi-Party Computation
A technology that allows several parties to collaborate and compute a function, while keeping their respective inputs private.
In a business context, this has been a huge advance for businesses looking to collaborate on their data (e.g. to build new data products), without having to share raw datasets. Secure multi-party computation is also useful in cases where regulations or trust-levels don't allow parties to share data between parties - e.g. across country borders.
Often uses methods from homomorphic encryption.

See also:

Homomorphic Encryption

Security
The rules and processes that protect data against unauthorised access (such as cyber attacks). The term is often falsely used as a synonym for Privacy. Mature data organisations are increasingly recognising that Security and Privacy are complementary and are putting measures in place to manage both authorised and unauthorised access to data consistently across the enterprise.

See also:

Privacy

Self-service Analytics
A data provisioning system that lets a large number of data users in an organisation access data for analytics, without needing to request access from IT for every query. This includes non-traditional data users, e.g. from the line of business, or so-called "citizen analysts". It's a huge step for organisations trying to break down data silos and derive value, insight, and competitive advantage from their data.

Such a central repository that includes rich and sensitive business data (e.g. on customers, transactions, financial performance) is highly vulnerable to privacy attacks. For an effective self-service analytics system to operate privacy-safely and at scale, its in-built data protection mechanisms need to be centrally managed and apply consistent privacy policies across all data. Also referred to as "Data-as-a-Service".

See also:

Big Data

Test & Dev Data
Data that's used in a non-production environment to test an application before moving it into production. It can be quite hard to synthetically generate data that's realistic and rich enough to confidently test for all use cases, so quite a few organisations still use raw data in their Test & Dev environments. This poses a big privacy risk.

Tokenisation
A data masking technique that replaces the field value with a 'token', a synthetic value that stands in for the real value. The pattern for the generated token is configurable and can be chosen to be of the same format as the source data to preserve data formats (e.g. for testing and development). Tokenisation can also be done consistently, meaning that the same value is always replaced with the same token, such that referential integrity is preserved in the dataset.

See also:

Test & Dev

Watermark
A pattern that's embedded into a privacy-safe data set and is hard to detect or remove. It makes it possible to trace published data back to its source (e.g. in case of a data breach). It also acts as an additional deterrent to sharing data outside its intended use. Watermarking is a feature of Privitar Publisher.

Tracker Attack
A tracker attack is where an attacker can isolate an individual customer by making multiple aggregate queries (e.g. AVG() and SUM() in SQL) on a data set. Most aggregate query interfaces are naïve and only block queries with small result set sizes. This feature alone is insufficient to preserve privacy.

Learn More

Please contact us to learn more