The A-Z of Privacy
Making sense of the jargon behind data protection
But we also know how important privacy is to organizations looking to derive value from data. It’s too important to remain obscure, so to help out we’ve compiled a glossary featuring some of the most important terms in privacy engineering. We’ll keep updating the list, so make sure to check back regularly. And if you can’t find a term you’re dying to understand, get in touch and we’ll do our best to help you out.
Learn the jargon. It’s useful for your job and it’ll make you a hit at dinner parties.
A data ‘summary’ where the values of a number of rows are grouped together to form a single value, such as e.g. an average salary per department, or a sum of all sales across a category. Also often used in machine learning models.
Since they don’t identify any one individual, such data aggregates may at first sight look safe to publish. But statistical summaries can be vulnerable to reconstruction or differencing attacks, which allow an attacker to identify individuals by making multiple queries.
Masking is one of the most commonly used protection mechanisms for sensitive data in organizations. It protects the privacy of individuals by obscuring direct identifiers (such as name, address, account number, email, phone number etc.), thus making it impossible to look anybody up in the dataset by these identifiers.
A measure of how useful a data set is (e.g. for business insight, analytics, machine learning, etc). The removal of personal data and/or noise addition can make data suffer a loss of utility.
There are a number of use cases in which sensitive data isn’t actually needed to achieve valuable results – and de-identified data is just as useful (we talk about it in this blog post). Businesses sometimes need to assess the privacy/utility trade-off when making decisions on privacy policies.
A value which could identify an individual on its own. Some direct identifiers are unique, such as social security numbers. Others are not unique but are highly identifying, such as name and address.
Several business drivers are pushing the need for meaningful data processing that doesn’t expose sensitive information, first and foremost:
- The desire to outsource data processing to the cloud in a way that lets processors process the data without knowing what it is.
- The emergence of new business models built around different contributors’ proprietary data, where all contributors want to know aggregate data about the group, but no one wants to reveal their own sensitive data.
- Homomorphic encryption can contribute towards secure multi-party computation.
Perturbation adds small amounts of random noise to field values or query results. It protects them against privacy attacks which rely on knowledge of specific values. Perturbation-based approaches must ensure that the noise magnitude is small enough that the valuable insights in the dataset are preserved.
Personally identifiable information, see personal data.
A data point that allows you to directly identify an individual, such as e.g, a bank card number, or a customer ID.
A term that refers to the methodologies, tools and techniques that can help manage sensitive data within an organization and provide acceptable levels of privacy. It’s an emerging discipline within software engineering. The goal of privacy engineering is to help businesses and other data-processing organizations derive value from their data while minimizing the risk that comes with analyzing, sharing, and querying sensitive information.
A Protected Data Domain (PDD) is a set of managed data releases, and allows the privacy risks of that data to be evaluated and mitigated. It is the unit of data for privacy governance and management. The PDD records data lineage, permitted recipient, purpose and lifetime of the data, and what privacy protections have been applied. Data in PDDs can be watermarked, enabling traceability in the event of a data breach.
Data owners apply data protection controls to PDDs, such as pseudonymization, generalization, and differential privacy. Datasets within a PDD retain referential integrity and linkability, but are not directly linkable to another PDD. This separation enables the data owner to calculate risk scores for each PDD and to reason about the implications of publishing or sharing data.
A data masking technique that deletes all or part of the field value. Often, redaction will retain high level information, but remove detail, such as retaining only the first part of a postcode (SE1 8RT ⇝ SE1), or the last four digits of a credit card number.
A data provisioning system that lets a large number of data users in an organization access data for analytics, without needing to request access from IT for every query. This includes non-traditional data users, e.g. from the line of business, or so-called “citizen analysts”. It’s a huge step for organizations trying to break down data silos and derive value, insight, and competitive advantage from their data.
Such a central repository that includes rich and sensitive business data (e.g. on customers, transactions, financial performance) is highly vulnerable to privacy attacks. For an effective self-service analytics system to operate privacy-safely and at scale, its in-built data protection mechanisms need to be centrally managed and apply consistent privacy policies across all data. Also referred to as “Data-as-a-Service”.
A data masking technique that replaces the field value with a ‘token’, a synthetic value that stands in for the real value. The pattern for the generated token is configurable and can be chosen to be of the same format as the source data to preserve data formats (e.g. for testing and development).
Tokenization can also be done consistently, meaning that the same value is always replaced with the same token, such that referential integrity is preserved in the dataset.
A tracker attack is where an attacker can isolate an individual customer by making multiple aggregate queries (e.g. AVG() and SUM() in SQL) on a data set. Most aggregate query interfaces are naïve and only block queries with small result set sizes. This feature alone is insufficient to preserve privacy.
Ready to learn more?
Our team of data privacy experts are here to answer your questions and discuss how data privacy can fuel your business.