Recently at Privitar we attended a recruiting event, as one of several companies in London presenting interesting projects to a pool of data scientists. As good data geeks, naturally my colleagues and I spent far too long trying to work out which would be the best slot to choose for our presentation – is it better to be early or late in the day? Definitely not the session right after lunch, but is last before lunch also bad? Eventually we went first, and pitched some of the interesting challenges we see in anonymising very rich and complex data.

It turned out to be a surprisingly good move – one data scientist later told us that after hearing our presentation she winced in horror through subsequent pitches, as company after company described how they were working directly with highly private data.

The truth is, there’s typically no good reason for a data scientist or analyst to have access to sensitive primary identifiers, such as name, phone number, or social security number.  These should just be dropped from the dataset. If they are needed for subsequent action (for example to contact customers), then that can still be done without the data scientist needing the contact details. They can create a model or rule, which is later applied to the original dataset.

The only valid use for such primary identifiers is in linking records between datasets (for example, joining across tables), or grouping multiple transactions by user. But even then, there’s still no need to expose the raw data. The right thing to do is to consistently tokenise the identifiers, replacing each unique value with a randomly generated token. So Jason might be replaced by ABC123, Jane by HBY940, and so on. If this tokenisation is done consistently, it’s still possible to group transactions or join records across datasets, but it’s not possible to recover the original identifier.

When tokenising, be sure to generate each token randomly. It’s a mistake to use a deterministic function, as was the case with a set of New York taxi data that was released recently under a freedom of information request. The taxi identifiers, called medallion numbers, were hashed with an MD5 hash function. This seemed at first glance like a good approach – each identifier was converted into into a random-looking string. But since there’s a fairly small range of medallion numbers, it’s easy and fast to hash every possible value and build a dictionary, and then deanonymise the data by looking up all the tokens in the dataset using this dictionary. If each token is randomly chosen, it’s impossible to build such a dictionary.