by Vaughn Micciche, Privacy Engineer at Privitar

The extreme fragmentation of enterprise data processing and analytic platforms is accelerating with the ongoing trends in machine learning and operational AI, necessitating additional platforms for data preparation and model training / retention beyond the traditional warehouse or data lake. The merits of data protection and privacy preservation are still just as relevant as they were in the context of a traditional data warehouse with its measures, dimensions, and normalized models, especially now that many companies are porting those old-school ‘single-sources of truth’ to the cloud.

Moving beyond the traditional approach

The traditional approach to data protection, privacy preservation, and security has been to prevent catastrophic loss through vehicles such as transparent disk encryption or volume encryption, coupled with tightly controlled data access policies and robust logging and log analytics. The trouble is that all these methods are focused on the data store or the analytic system as the item that needs protecting, but the data is the real asset. We’ve focused on the bucket rather than the pebbles of sand in the bucket, and if we’re not careful, a grain lost here or there will make someone else a nice sand castle.

So where are we going with privacy preservation?

Future-proof privacy preservation and data protection is a state of being where the enterprise who owns the data remains in control of that data — independent of the parent system, cloud provider, or partner organization. There are many pieces required for such a strategy, and, as it turns out, there are ties back to old-school data warehousing.

Joins, aggregations, group-by, and filtering are terms just as relevant today as ever, but with the data lake concept and the ‘store-it now and sort-it out later’ methods enabled by modern data processing platforms, some of the structure has been lost from the traditional data warehouse.

Traditionally, great thought, time, and energy were put into designing table structures that were linkable through keys and that housed normalized meaningful descriptive identifiers, attributes, and transactions. To put it more simply, the customer data was in a customer table, and that was linkable to a transaction table, which brought together the customer and a product, along with information describing the transaction, such as price, quantity, date, time, and maybe a point-in-time product cost to calculate better margin statistics. To talk about these data structures we used terms such as:

  • Measures
  • Dimensions
  • Primary Identifiers

Envisioning the modern, data-centric, protection strategy

The modern, data-centric, protection strategy has focused mainly on the direct identifiers through a strategy of direct replacement using something like tokenization or format preserving encryption. The thinking with this approach was that something is better than nothing — and if the direct identifiers are no-longer identifying an individual, then the measures and dimensional data can be used for operational or analytic purposes, with no meaningful impact to a data breach risk profile. Turning Bob into XYZ works if all you have is Bob’s name, but when we add in the dimensional and quantitative measures/transactional data, the odds of re identification increases. With that increase comes greater risk from a data breach, independent of what the cyber insurance underwriters or existing data protection laws have defined.

Retaining usability of dimensional data

What’s an enterprise to do then, you ask? How do we adopt a data-centric approach to protection while also retaining usability of the dimensional data for analytic purposes? The answer: by building a solution that delivers datasets that support the analytic endeavors, while also retaining the privacy of the underlying customers, patients, or individuals. In practice, this manifests as a protected zone where the data is stored with very tight access controls, and the identifying primary and dimensional data are completely de-identified. Consumers or down-stream analytic systems can consume the fruits of this protected zone only after the data has been packaged up in such a way that their analytic purpose can be fulfilled (while also preserving the privacy of the data subjects).

Our approach to de-identifying data meaningfully

For instance, within the protected zone, the birth date may have been tokenized to a random date, and ZIP code may have been tokenized to a random number, which would completely ruin the utility of the ZIP code as it’s not joinable and not near any of its old neighboring ZIP codes in number. For a marketing use case, you might want to use a meaningful age and region, while also preserving the privacy of the data subjects. But, the clear-text ZIP code and DOB represent risk because they can be used to re-identify a data subject. The Privitar Data Privacy Platform can be used as a means to reduce their resolution and risks to their privacy by letting them “hide in the crowd” (for example, generalizing the birth date to 1977-12-05 to 1977-12-01 and the ZIP code to 10011 to 100XX) while maintaining analytic utility.

Privitar’s policy rules can be used to build regions above ZIP codes that contain enough data subjects so as not to be identifying. Similar capabilities exist to perform bucketing, linkable and un-linkable tokenization, value shifting to perturb identifying dates, as well as single and multi-column logic for applying the privacy rules under certain circumstances. When used in this manner, between a protected zone and down-stream analytic systems, Privitar acts like a privacy bridge between the tightly protected operational datastore and the more free-wheeling analytic consumers.

In this way, the Privitar Data Privacy Platform is using its reusable policy to create a privacy preserving dataset that does not need further transformation or un-protection to be useful to the analysts or machine learning algorithms. There is no need to execute web service calls or other function calls or proxy services to unprotect the dataset ‘in-use.’ This means that the dataset is portable across clouds and usable by any traditional or cutting edge analytic system. Privitar’s reusable policy and rules enable the data governance or privacy center of excellence to retain balance between risk and usability, without leaning on the specific data access controls of each and every platform. They are enabled to act as a data-delivery function, rather than a growth- and innovation-inhibiting policing function that requires long approvals and testing for new analytic systems that have tight integration with the enterprise security functions. This new dimension of data protection and privacy is possible with the Privitar Data Privacy Platform.

Want to learn more about data de-identification? Read Data Privacy 101: Guide to De-Identification