As more personal information is collected about individuals, the threat to privacy becomes ever greater. New technology brings with it more sophisticated methods for obtaining sensitive information by malicious means.

Protecting your customers’ sensitive data can be daunting. To defend yourself effectively, you need to understand what you’re up against.

Often the most valuable insights in data science come from connecting different data sources. So it should come as no surprise that sensitive data can also be uncovered by linking data. This type of privacy violation is termed a linkage attack.

A linkage attack attempts to re-identify individuals in an anonymised dataset by combining that data with another dataset. The ‘linking’ uses indirect identifiers also known as quasi-identifiers.

Quasi-identifiers are pieces of information that are not themselves unique identifiers, but can become identifying when combined with other quasi-identifiers. For instance, an individual’s date of birth and postcode are quasi identifiers; each one alone is not sufficient to identify an individual, but in combination they usually are.  Information such as salary, transaction history, overdraft limit, location data and many others are examples of quasi-identifiers.

Linkage attacks are powerful, because seemingly innocuous attributes often suffice to uniquely identify an individual.

Linkage attacks first hit the headlines in 1997,  when Massachussets state group insurance commission (GIC) released hospital visit data to researchers, for the purpose of improving healthcare and controlling costs. William Weld, then governer of Massachussets reassured the public that patient privacy was adequately protected, by deleting direct identifiers.

In response, Latanya Sweeney, then MIT graduate student in computer science  brought an electoral role database for 20$. By combining this data with the GIC records, she was able to find William Weid’s personal health records with ease.

To resist a linkage attack, the quasi-identifiers in a dataset must be transformed to achieve k-anonymity. This means even when someone has auxiliary information, each record is still indistinguishable from at least k-1 other records.

This technique is appropriate for ‘rectangular’ or ‘model ready’ data. This form of data is typically used by data scientists and data modellers and is characterised by having one row per data subject with each row having a number of ‘features’ exposed as attributes.

One particular attack is where an adversary knows the details of a specific transaction, for an unusual amount, such as ??123.45. By finding this transaction in the data set they can then identify the individual who made the transaction ‘ this attack relies on knowing the trace value, date, merchant or payee, and it being unusual.

To thwart these attacks, a data set is transformed such that the target values and dates are adjusted but are still accurate enough for useful anaylsis.

Advances in privacy engineering can help protect against these attacks. Privitar enables companies to take a comprehensive approach to privacy, by understanding and minimising privacy risks, whilst maximising data utility.