By Paul Preuveneers, Global VP of Sales Engineering at Privitar

Most organizations know how important it is to protect data for analytics, but knowing the importance doesn’t mean that actually doing this is easy. There is no silver bullet to making data safe, nor is there one single source of truth that provides all the desired outcomes for all data types, in all use cases. Safeguarding data is a contextual challenge so different approaches are needed depending on the context – the data, the use case, the user. 

So how can an organization build out safe data pipelines, decide which methods of protecting data are right for their organization, and get started or advance in their use of safe data for analytics and insights?

Making Data Safe For Analytics

How do you make data safe for analytics? This is one of the most common questions I hear, and at first glance, it seems quite simple…but the answer is actually quite complex.

First of all, you have to make sure you understand the big challenge. Data analysts need to be able to receive and use data sets so that they can work their “analyst magic” on it and return with results that their businesses can effectively use to make decisions.

For example, if you are working in healthcare and you need to analyze data to find potential cancer patients, you’d likely be looking at information on the patients, their procedures, complaints, and other sensitive, personal information. But providing raw data, and allowing analysts to have direct access to personally identifiable information in that data set is a huge breach of ethics and regulations.

So how can we give the analysts the data they need without compromising the privacy of the subjects of that data? That’s where data privacy comes in. 

With data privacy, we can make sure that the only things that you can see are, in fact, the things that you absolutely NEED for your job, for a particular analytical scenario. In the example mentioned above, you don’t necessarily need to see full patient names, as that is unlikely to affect what you’re trying to do.

We need to create a set of data that has the information you require and nothing more. Everything else is removed or blurred and transformed into a protected form so that it’s safe, keeps you within regulations, and compliant.

We’ve got to walk that fine line between what we call utilization, and the privacy of that data, so that there is enough information to be useful for analysis, and removing sensitive parts of the data exposed to risk.

But the trick is what do you keep? What do you blur or remove or redact from each data set to make sure that you can still do your job, as a data analyst? 

Keeping Data Safe and Flexible for All Use Cases

There are many, many different methods to blur data in different ways, or to change it to prevent different types of attack or re-identification– from data masking to k-anonymization or linkage attacks. Each use case is going to be different and the best method is going to change based upon the use case. 

Using the right method for a use case is critical to making sure you get the best results from the protected datasets you are going to share with your analyst.

Let’s go back to the example of the set of healthcare data. The problem is that if I gave you the raw unprotected data, you could identify actual people and know things about them that you shouldn’t. Some people may decide to use that information against them. We need to protect against that, so we would “de-identify” that data set – remove the personal identifiers from it.

There are different types of identifiers, including direct identifiers (e.g. a social security number or passport number) which are unique to an individual and directly identify them.

But there are also other types of information that may not directly identify you, but if you combine a few pieces of information, might start painting a picture of who you might be, which can then be narrowed down enough to identify a person. For example, if you had my first name, birthday, cars I drive, postal code, and the school I went to, suddenly you have narrowed down the number of people I could be. None of these pieces of data directly identify me, as there are others with my birthday and same first name and so on, but they help indirectly identify me when put together. 

We need to manage both these types of PII or personally identifiable information and protect against re-identification.

If we can’t simply remove the data from the provided dataset because an analyst needs it for their work, we can blur the data in different ways. We could:

  • Add a random number of days to a birthday
  • Aggregate peoples ages into decades 
  • Change Social Security Number (SSN) or passport numbers into other fake but unique numbers that look like SSN or passport numbers but can be changed back later if needed
  • Clip names to initials, mask them entirely, or replace them with a lookup list of other names

And that’s just the simple stuff – the direct identifiers! 

For the indirect identifiers, we need to apply protection that stops you taking these 3 or 4 pieces of data and filtering them through the vast data sets that are out there (e.g. using LinkedIn and Facebook) and re-identifying you. We call this linkage attack protection or K-Anonymisation.

So how to decide which are the right methods of data protection to use? And who should decide? Enter the data guardian.

The role of data guardian and importance of creating policies

In many companies, there is at least one person that is tasked with finding the right protection methods to allow the data to be used safely for analytics. We call that person the “data guardian.”

Data guardians need to determine the most appropriate sets of rules for individual use cases as efficiently as possible, to protect the identities within the data as well as the company. In addition, they need to understand the relevant regulations that apply to their industries and  in their locations, to make sure that their company complies effectively. 

They create policies, or sets of rules, that apply to specific use cases and regulations so that they are saved for future use against any other similar data structures, and become an easily reusable asset that can simply be applied to similar data sets in the future. Policies allow you to understand how to be as compliant as possible, as quickly as possible. Using and reusing policies allow even advanced data protection techniques like k-anonymization or linkage attack protection to be applied both appropriately  and quickly.

Remember, this is all done so that you can achieve the maximum value from your data analysis, without holding up access to that data unnecessarily. If you’ve got access to a system that can allow the data guardian to find and even apply complicated data policies quickly, it all gets so much easier for everyone. Data can be leveraged safely, more broadly, more effectively, and more efficiently– a huge win for everyone involved!

To learn more about how data privacy enables safe analytics, check out Privitar’s Safe Analytics Resource Hub. You can also speak with one of our team members to learn more about how Privitar can help you democratize data within your organization.