Self-service access to safe data
Protect data and manage risk
Analyze conversational chat data
Reduce the time and cost to comply
Right data in the right hands
Align control and business use
Controlled access to data
Flexibility, consistency, scalability
Our professional services
Power responsible use
From clinical to commercial
Optimize data tests
Open new revenue streams
Realize the potential of the cloud
Protect data from misuse
Transform your data
Opinion and industry insights
An A to Z of the industry
The podcast for data leaders
Press releases, awards, and more
Staying at the cutting edge
The team behind Privitar
A thriving partner ecosystem
Our story, values, and careers
Dedicated customer assistance
Jul 26, 2016
Recently at Privitar we attended a recruiting event, as one of several companies in London presenting interesting projects to a pool of data scientists. As good data geeks, naturally my colleagues and I spent far too long trying to work out which would be the best slot to choose for our presentation – is it better to be early or late in the day? Definitely not the session right after lunch, but is last before lunch also bad? Eventually we went first, and pitched some of the interesting challenges we see in anonymising very rich and complex data.
It turned out to be a surprisingly good move – one data scientist later told us that after hearing our presentation she winced in horror through subsequent pitches, as company after company described how they were working directly with highly private data.
The truth is, there’s typically no good reason for a data scientist or analyst to have access to sensitive primary identifiers, such as name, phone number, or social security number. These should just be dropped from the dataset. If they are needed for subsequent action (for example to contact customers), then that can still be done without the data scientist needing the contact details. They can create a model or rule, which is later applied to the original dataset.
The only valid use for such primary identifiers is in linking records between datasets (for example, joining across tables), or grouping multiple transactions by user. But even then, there’s still no need to expose the raw data. The right thing to do is to consistently tokenise the identifiers, replacing each unique value with a randomly generated token. So Jason might be replaced by ABC123, Jane by HBY940, and so on. If this tokenisation is done consistently, it’s still possible to group transactions or join records across datasets, but it’s not possible to recover the original identifier.
When tokenising, be sure to generate each token randomly. It’s a mistake to use a deterministic function, as was the case with a set of New York taxi data that was released recently under a freedom of information request. The taxi identifiers, called medallion numbers, were hashed with an MD5 hash function. This seemed at first glance like a good approach – each identifier was converted into into a random-looking string. But since there’s a fairly small range of medallion numbers, it’s easy and fast to hash every possible value and build a dictionary, and then deanonymise the data by looking up all the tokens in the dataset using this dictionary. If each token is randomly chosen, it’s impossible to build such a dictionary.
Our team of data security and privacy experts are here to answer your questions and discuss how modern data provisioning can fuel business growth.