by David Bernstein, Data Privacy Engineer at Privitar

What is PII (Personally Identifiable Information)?  It’s more complicated than you think. Most of us have heard the acronym PII, and if asked for a definition we would say that it is Personally Identifiable Information. But if you go beyond the acronym, and ask five different people for their definition, you might get five different answers.

Why the confusion about Personally Identifiable Information?

There are a number of factors that contribute to the confusion around this topic. In some cases, multiple definitions exist because different countries have their own terms. In the European Union for example, such data is referred to as ‘Personal Data’ and the terms are often used interchangeably. Throw in other terms such as personal information, private information, individually identifiable information, and protected health information…. Well, the confusion quickly grows as to how they differ from each other.

In order for a country to have laws protecting PII, there must be a legal definition of what PII is. That definition varies from country to country, and the differences can be very granular. Even different states in the United States are coming up with their own legal terms. Take a look at Personal and Private Information under CCPA, for example. And earlier this month, California passed changes to the CCPA, Proposition 24, also known as the California Privacy Rights Act of 2020, or CPRA (learn more about what that means for businesses here).

While I stated above that PII and Personal Data are often used interchangeably, I didn’t say that it was correct… at least not according to GDPR.

There are subtle differences in the context of the General Data Protection Regulation (GDPR). To learn more about those differences, check out this detailed post on Tech GDPR.

Defining Personally Identifiable Information

In the United States, the U.S. General Service Administration defines PII as:

“…information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other personal or identifying information that is linked or linkable to a specific individual.” 

Got it? Neither do I…this definition just isn’t enough to clearly understand it.

My goal in this post is to impart a general understanding of what PII is. The most important part of PII is to be able to examine data, look at a field, and determine “that is a sensitive field likely containing PII.” If you’re sitting across the table from a lawyer, this might be a different discussion.

For our purposes, for business purposes, when we think of PII the first thing that comes to mind are direct identifiers or unique identifiers. For example, your social security number (or international equivalent, such as a TaxID) is a unique identifier – meaning that if I know yours, then I know who you are. Within a company, an EmployeeID has the same role as a unique identifier. In healthcare, a PatientID is similar – a PatientID is only linked to one single person. All of these examples are direct identifiers, and they are most definitely considered PII.

Strong identifiers are the next type of identifier, which are close to being unique. Take your name, for example. There is likely another person on this earth with your same name, but it is still a very strong identifier. Your name (first and last), phone number, address, and email address all fall into this category and are defined as PII. Lately, with contact tracing applications for COVID-19, MAC addresses (think of a MAC address as a discoverable identifier attached to your computer or mobile device) are now being grouped under the umbrella of PII as a strong identifier.

So that’s it, that’s PII?

Social security numbers, names, email addresses, phone numbers, mailing addresses… that’s what PII is? Not quite….

I recently attended an online conference session on machine learning in which the presenter remarked that no PII was used. I beg to differ! We also need to consider indirect or quasi identifiers as PII. Sometimes those terms are used interchangeably, but there are separate definitions that distinguish the two. For today’s purposes, however, we will lump them together.

Think of ‘Indirect’ or ‘Quasi’ identifiers as pieces of information that on their own are not identifying, yet in combination, can be identifying.

Let’s say I have shared the city I live in, for me that’s around three million people. If you add the ZIP code, you’ve reduced your data set to 90,000 people. Am I identifiable? Of course not.

Now let’s add in my birth date, including the year. I’m excited when I meet someone with the same birthday as myself. The chances now of me being identified are around one in 3.5… and if we add gender to the equation, the chances of identifying me are cut in half. Am I identifiable now? Yes!

Remember what the acronym PII means: personally identifiable information. It’s not just looking at each piece of information individually, but how they could be combined to make a person identifiable. That’s where it really gets tricky. So, back to my remark on the machine learning presentation… it turns out that the data used in the demo was full of indirect identifiers!

For further information on the combinations of indirect identifiers and how they can be combined to discover a person’s identity, see this blog post explaining k-anonymity.

What’s next?

Now that we can identify what information can constitute PII – what’s next? Most businesses have vast quantities of data, which can help them gain insights into their customers and develop and deliver new solutions. But you have to take data privacy into account before you can use that data. So the next step is to decide which privacy technique to use to de-identify the data that we have determined is PII.

You will hear terms such as masking, redaction, tokenization. These are all part of a number of techniques that can be used to protect PII.

But always remember, the most important part is determining what needs to be protected, and often this step is done hastily without the proper inspection. I hope this helps in your journey to promote safe, de-identified data for analytics, or whatever your use case may be.

Learn how ABN AMRO unlocks the full value of its data with cloud data privacy. Read the case study.