At the board-room level, data privacy can easily be viewed as a binary matter: data encryption techniques should be employed and the company’s data assets become secure. However, for the security teams chartered with securing sensitive assets, the realities are not so simple. There are many threats that cannot be mitigated by encryption alone, which is why advanced statistical anonymization techniques that achieve privacy whilst preserving data utility and analytical value should be considered. In this post we look at some of the key privacy risks facing organizations today, and how privacy engineering technology can help mitigate against them.
In direct lookup, attackers find targets in a dataset by finding the identifier that corresponds to them. All directly identifying information should be tokenized by default for most use cases, and in fact PCI-DSS compliance requires such practices.
An internal analyst has access to sensitive data, including direct identifiers such as social security numbers and credit card numbers. The analyst puts the data in a spreadsheet for business use.
With tokenization, the direct identifiers are no longer valuable and useful, but the analyst may still do work. With redaction and suppression, fields and columns that are not necessary for the analyst’s work are removed, so that they pose no threat.
In a linkage attack, an attacker combines information, including indirect identifiers and/or external information, to identify an individual. Once an individual is identified in a tokenized data set, the tokenization is “broken” as it is effectively reversed for that individual.
The internal analyst from above now has a tokenized copy of the sensitive data set and wishes to learn the salary of an individual customer John Smith.
The analyst knows that John Smith took out a loan on a given date (i.e. 4-May-2016) for a certain tenor (i.e. 30 years). If that was the only 30-year loan taken out on that day, the analyst can identify John Smith, and therefore can isolate the tokens used for John Smith within the data set. The analyst may now discover personal information about John Smith throughout the data set, e.g. salary payments or credit history information.
Generalization solves this by bucketing (a.k.a. “banding” or “blurring”) individual records into indistinguishable groups of size k. Individuals within the group share indirect identifiers, and so attackers may not isolate them in the bucket. For this example, the loan dates may be blurred to exclude the day, i.e. 4-May-2016 becomes May-2016; 30 and 40 year loans become 35 year loans; and so forth.
In a homogeneity attack, an attacker can discover sensitive information about an individual simply by their inclusion in a k-anonymous group.
The internal analyst from above has a tokenized and generalized copy of the sensitive data set and wishes to learn whether John Smith has ever declared bankruptcy, which is in the data set as part of the loan application.
As above, the analyst knows that John Smith took out a 30-year loan on 4-May-2016, finds the generalized “bucket” of five customers who took out 30-year loans in May, and knows John Smith must be part of that bucket. All five of those customers have declared bankruptcy, so the analyst may conclude that John Smith has declared bankruptcy.
‘?-diversity solves this by ensuring that there is reasonable diversity within the bucket. In this case, if ‘? = 2, at least one of the customers in the bucket has a different value for bankruptcy declaration, thwarting the attack.
A specific-value linkage attack is where an attacker can isolate an individual customer by finding specific data within a data set. This often applies to transactional data, e.g. credit card transactions.
An IT administrator has access to tokenized credit card transactions, e.g. system logs, and wishes to obtain a colleague’s credit card history (e.g. to learn whether the colleague has been having an affair).
The IT admin goes to coffee with the colleague, noting the time (say, 10:30am), amount (say, ??2.40), and the name of the coffee shop. The IT administrator then searches the transaction data set for a purchase of ??2.40 at around 10:30am at the coffee shop in question. This isolates the colleague’s token, allowing the IT administrator access to full credit card history.
Perturbation prevents this by blurring the data slightly. For example, the purchase amount may be set to ?? ??0.20, meaning ??2.40 may be anywhere from ??2.20 to ??2.60; the time may be modified ?? two hours, so 10:30am would be between 8:30am and 12:30pm. This makes it difficult if not impossible to isolate a particular transaction.
A tracker attack is where an attacker can isolate an individual customer by making multiple aggregate queries (e.g. AVG() and SUM() in SQL) on a data set. Most aggregate query interfaces are na??ve and only block queries with small result set sizes. This feature alone is insufficient to preserve privacy.
An HR consultant has access to a controlled query interface that enables aggregate queries on an HR data set. The HR consultant wants to obtain the salary for an individual employee, Jane Smith.
The HR consultant knows that Jane Smith is the only Managing Director based in Austin, Texas. The HR consultant queries for the SUM() of all employee salaries in Austin, Texas, then queries for the SUM() of all employee salaries in Austin, Texas, where the employee is not a Managing Director. By subtracting the result of the second query from the first, the HR consultant can learn Jane Smith’s salary.
A differentially private controlled query interface thwarts this attack through several methods, including noise addition. For example, the interface will add random noise to a result, increasing the noise as the attacker attempts to drill down to fewer and fewer records. An attempt at a tracker attack as above yields information that is so imprecise that it is useless to the attacker.
Our team of data privacy experts is here to answer your questions and discuss how data privacy can fuel your business.