Self-service access to safe data
Protect data and manage risk
Analyze conversational chat data
Right data in the right hands
Align control and business use
Controlled access to data
Flexibility, consistency, scalability
Our professional services
Power responsible use
From clinical to commercial
Optimize data tests
Open new revenue streams
Realize the potential of the cloud
Protect data from misuse
Transform your data
Opinion and industry insights
An A to Z of the industry
The podcast for data leaders
Press releases, awards, and more
Staying at the cutting edge
The team behind Privitar
A thriving partner ecosystem
Our story, values, and careers
Dedicated customer assistance
May 04, 2020
By Javier Abascal Carrasco, Engineer at Privitar
Being a data scientist is hard for many reasons, a significant one being the famous 80/20 dilemma. Data scientists and machine learning experts spend about 80% of their time generating, preparing and labeling data and only 20% of their time building and training models! Isn’t it crazy? You hire someone because of their ability to build complex and sophisticated models but they barely spend time doing it.
Don’t get me wrong. Obtaining, crunching and preparing data is part of the job and has huge implications on final model performance. At the end of the day, a learning model is only going to be as good as the supporting data. It is crucial to pay attention and try to maximize efficiency on the time spent in the data preparation stage. For the rest of this post, I would like to highlight why privacy is correlated with the work of a data scientist and how an organization can accelerate the time to realizing the value of data.
A model tends to start with an objective of what we want to achieve (i.e., predicting something or classifying a subset of a population). Once that is clear, we need to find relevant data sources which will help us realize those goals. In most of today’s cases, data sits in tables across a multitude of data warehouses, sometimes across several distinct environments. In the best case scenario, you will have a data catalog in place which can be used to identify the data. If not, you must reach out to different teams to understand what is available. At the end of the day, you will end up doing two main activities:
There are a couple serious consequences for data scientists. First, there is friction in the access request process that can easily take days, weeks, or even months; this can depend on sensitivity of the data as well as the processes currently in place, technology limitations, and cross-departmental approvals. The data scientist will need to provide a justification for access or even have specific meetings with security and privacy individuals in order to gain approval. And if the data will be used in the cloud, there is likely an additional process to ensure the data is protected adequately from breach to minimize risk to the organization.
Second, data scientists will get access to sensitive data, including the ability to identify individuals and potentially harm the organization if they disclose certain details. Internal actors were responsible for 43% of data loss based on an well known 2015 Intel/McAfee report. Often they don’t need sensitive columns as part of their data analysis, but they can access them because the sensitive data sits together with the more useful pieces of information.
So, how can you mitigate these consequences?
Very simple. With data privacy.
De-identifying data using data privacy techniques addresses the friction and risk around using sensitive data, enabling data scientists to minimize the time collating data and allowing them to spend more time running and analyzing models. There are several critical privacy aspects that organizations should aim to achieve when adopting data privacy to better empower data scientists:
When well orchestrated, these previous points will let the security and privacy departments accelerate the acceptance of data access, allowing scientists to explore and visualize data faster and without friction.
As a result, people accessing information won’t be working with raw data, reducing the overall risk to the organization. Moreover, the fact that data is watermarked deters insider misuse and negligence since they can be easily identified and information revealed won’t have value outside of the organization.
Last but not least, the main reason data scientists are reluctant to work with protected data is because their past experiences used basic masking techniques that destroyed the utility of the data and hence, reduced the performance of the models trained. Applying advanced privacy policies gives data scientists the capacity to join data across tables, keeping the value of categorical variables and adjusting the level of privacy they want for numerical variables (inserting controlled noise that will keep the statistical value of it). These policies give them full control and flexibility, significantly reducing the trade-off in model performance versus risk mitigation. In short, the current bias that exists from data scientists against the use of protected data is because of past experiences using basic privacy techniques.
The use of a cutting-edge privacy platform, such as Privitar, will allow your organization to decrease the risk of friction when accessing sensitive data sources, enabling you to spend significantly less time organizing and collating data and more time gaining critical insights from the analysis.
Sorry, no posts matched your criteria.
Our team of data security and privacy experts are here to answer your questions and discuss how modern data provisioning can fuel business growth.