Here you can find a short video interview with Dr. Pierre-Andre Maugis, Research Scientist at Privitar, in which he talks about some of the key challenges connected to hashing as a privacy technique.
Why do people use hashing?
Many people in the industry suggest hashing as a reliable privacy solution. There are three main reasons for this:
- Firstly, the hash function will turn identifying information into gibberish (technically called a hash), in a way that is supposedly irreversible; as in you cannot reconstruct the original identifying information from the hash.
- Secondly, hashing is a commonly-used tool by data scientists, IT, and software engineers.
- Thirdly, it is a associated with strong security because it is used in applications like password security.
So, how can hashing still leave data exposed?
The trick is that even if you can’t reconstruct the identifying information from the hash, what you can do is build a dictionary. Building this dictionary is a two step process: First you make a list of all possible identifying items; then you apply the hash function to all items in the list, and build a list of all the corresponding hashes. The result is a dictionary: give me a hash, I can find it in my list, and then tell you the identifying information that it came from.
There are two problems with building a dictionary. First, how do I know which hash function to use? Second, how long does this take? For the first, the fact is that there are only a few good secure hash functions available, so one can try them all. However, it is even sometimes possible to guess the hash function being used through the length and structure of the hashes.
As for the time it takes, well, the larger the range of different values the identifying information can take, the longer building this dictionary takes. However, hash functions are fast, and unless there are billions of possible secret values at least, building the dictionary will not take an intractable amount of time.
What should people be doing instead?
All good solutions require a secure secret, which consists of either a key, a salt, or a token vault (a map of identifiers to random values). This secret is protected by strong security. As there is no way to avoid relying on a secret, encryption or tokenisation is what we recommend: either encrypting the identifying information itself, or keeping a secret list where the identifying information is mapped to the tokens. Doing this will give you better privacy and the same, if not more, utility than simple hashing.