by Vaughn Micciche, Privacy Engineer at Privitar
What is Tokenization?
Tokenization is a form of fine-grained data protection which replaces a clear value with a randomly generated synthetic value which stands in for the original as a ‘token.’ The pattern for the tokenized value is configurable and can retain the same format as the original which means less down-stream application changes, enhanced data sharing, and more meaningful testing and development with the protected data.
Putting Tokenization Into Practice
The first step in understanding tokenization is to accept that in the modern data landscape there will continue to be more sophisticated and novel methods for outsiders to gain access and control of your data and that governments will continue to hold data owners accountable for negative impacts to the public from data loss. The second step is to accept that moats and walls are good deterrents but become ineffective when the boundary of an enterprise becomes more and more fragmented with the adoption of SaaS, Cloud, and 3rd party data processing. One solution is to protect the data itself using fine-grained data protection which is simply changing a cleartext value like ‘Bob’ into ‘XYZ’ and then storing, sending, and processing ‘XYZ’ instead of the sensitive cleartext ‘Bob’.
How do we get to ‘XYZ’ from ‘Bob’ you ask? There are 2 forms of fine-grained protection which are considered to be secure enough for enterprise use and they are tokenization and Format Preserving Encryption (FPE). For operational use FPE and some forms of vaultless tokenization are a good choice due to their speed at scale, and retained ability to unprotect back to the cleartext value for all values meaning that even legal archives can be protected. A vaulted tokenization solution has unique merit when analytics and privacy preserving data sets, or individual token lifecycle management are required. Let’s take a step back and fill in some technical details. The privacy industry has struggled to adopt consistent definitions across the fine-grained protection methods. Tokenization as a form of protecting data has been around a LONG time but many forms of it exist today. Often the term is used as a way of simply describing how one value is replacing another in a data set independent of the means by which that replacement was derived. But as we will discuss below, the means are everything and will dictate many things about the eventual usability and security of tokenized data.
Tokenization requires some form of random mapping between the original value and the resulting protected value. Due to this randomness it is a more secure way of protecting data than format preserving encryption which is reinforced by regulations such as PCI DSS which do not require sensitive credit card data to be re-protected if it were tokenized, but do require this rotation if the data were encrypted.
Approaches to Tokenization
A vaulted approach to tokenization is the traditional approach and it involves persisting the cleartext values mapped to their random tokens within a vault. This form is the most secure way to tokenize data as it enables on-going management of individual tokens & cleartext values (token lifecycle management). It also enables greater security by allowing for extreme growth of the domain of possible tokens above the domain of possible cleartext values. Put another way, if a 3 digit number were being tokenized with a different 3 digit number then there are only about 1000 possibilities and someone could infer that total if they were to gain access to all tokens which may diminish the privacy & security of the information within the underlying data. With vaulted tokenization this is averted by adding any number of extra digits, so the thousand original values would be randomly dispersed within about a million potential token values if another 3 digits were added. This same concept applies to alpha and alpha-numeric values.
For vaulted tokenization there are random values generated and lookups into the vault to ensure these values have not previously been mapped to cleartext values. Similarly when time to unprotect, a lookup is completed and the cleartext returned. This method of protection has been used for years to protect credit card data and is now being applied to other types of sensitive data. Vaulted solutions do have some drawbacks when there are high-throughput requirements for operational use-cases and the number of potential token values is nearing its upper bound. In these instances it may take multiple tries for the random value generator to produce a completely new token which adds time to the task. Also, as token vaults grow in size their response times may diminish, although with modern cloud datastores this is becoming less of an impact.
To mitigate the potential performance impacts of a vaulted solution some companies choose to forgo the ability to perform token lifecycle management and right to be forgotten in lieu of faster operational protection events with something like Format Preserving Encryption. Often these companies will choose to use a vault for analytic environments to comply with the growing number of privacy regulations that stipulate ‘forgetting’ cleartext values for analytic purposes. With vaulted tokenization this is possible while retaining usefulness of the data for matching, aggregating, and filtering by the token values rather than cleartext.
In closing, we’ve introduced the notion of fine-grained protection, why it matters, and discussed a few common forms being FPE and Tokenization. We’ve also outlined why analytic and privacy use cases employ a vaulted tokenization approach, whereas FPE or vaultless tokenization might be good for operational and archival use cases, and discussed how some companies are doing this today.
Want to learn more about how data tokenization and other forms of de-identification can help you keep your data safe and usable? Check out Privitar’s Complete Guide to Data De-Identification.