By Javier Abascal Carrasco, Engineer at Privitar
Being a data scientist is hard for many reasons, a significant one being the famous 80/20 dilemma. Data scientists and machine learning experts spend about 80% of their time generating, preparing and labeling data and only 20% of their time building and training models! Isn’t it crazy? You hire someone because of their ability to build complex and sophisticated models but they barely spend time doing it.
Don’t get me wrong. Obtaining, crunching and preparing data is part of the job and has huge implications on final model performance. At the end of the day, a learning model is only going to be as good as the supporting data. It is crucial to pay attention and try to maximize efficiency on the time spent in the data preparation stage. For the rest of this post, I would like to highlight why privacy is correlated with the work of a data scientist and how an organization can accelerate the time to realizing the value of data.
As a Data Scientist, I Need to Explore and Understand What Is Inside these Tables
A model tends to start with an objective of what we want to achieve (i.e., predicting something or classifying a subset of a population). Once that is clear, we need to find relevant data sources which will help us realize those goals. In most of today’s cases, data sits in tables across a multitude of data warehouses, sometimes across several distinct environments. In the best case scenario, you will have a data catalog in place which can be used to identify the data. If not, you must reach out to different teams to understand what is available. At the end of the day, you will end up doing two main activities:
- Accessing a multitude of data tables, including ones with highly sensitive information, which will force you to request access
- Querying tables, to visualize a few rows and get a sense of what information is stored there.
There are a couple serious consequences for data scientists. First, there is friction in the access request process that can easily take days, weeks, or even months; this can depend on sensitivity of the data as well as the processes currently in place, technology limitations, and cross-departmental approvals. The data scientist will need to provide a justification for access or even have specific meetings with security and privacy individuals in order to gain approval. And if the data will be used in the cloud, there is likely an additional process to ensure the data is protected adequately from breach to minimize risk to the organization.
Second, data scientists will get access to sensitive data, including the ability to identify individuals and potentially harm the organization if they disclose certain details. Internal actors were responsible for 43% of data loss based on an well known 2015 Intel/McAfee report. Often they don’t need sensitive columns as part of their data analysis, but they can access them because the sensitive data sits together with the more useful pieces of information.
So, how can you mitigate these consequences?
Very simple. With data privacy.
Data Privacy Helps Democratize Access to Data, Reducing Risk While Keeping Utility
De-identifying data using data privacy techniques addresses the friction and risk around using sensitive data, enabling data scientists to minimize the time collating data and allowing them to spend more time running and analyzing models. There are several critical privacy aspects that organizations should aim to achieve when adopting data privacy to better empower data scientists:
- Discovery of sensitive data sources and personally identifiable information across the organization
- Creation of a data catalogue, which accelerates and empowers the search of useful information.
- Availability of advanced enhancing techniques, in form of rules to be applied, for protecting sensitive records. Allowing the creation of privacy policies mapped to datasets structures that do not reveal sensitive information.
- Capacity to control the data releases in specific domains and with traceability through the usage of data watermarking.
When well orchestrated, these previous points will let the security and privacy departments accelerate the acceptance of data access, allowing scientists to explore and visualize data faster and without friction.
As a result, people accessing information won’t be working with raw data, reducing the overall risk to the organization. Moreover, the fact that data is watermarked deters insider misuse and negligence since they can be easily identified and information revealed won’t have value outside of the organization.
Last but not least, the main reason data scientists are reluctant to work with protected data is because their past experiences used basic masking techniques that destroyed the utility of the data and hence, reduced the performance of the models trained. Applying advanced privacy policies gives data scientists the capacity to join data across tables, keeping the value of categorical variables and adjusting the level of privacy they want for numerical variables (inserting controlled noise that will keep the statistical value of it). These policies give them full control and flexibility, significantly reducing the trade-off in model performance versus risk mitigation. In short, the current bias that exists from data scientists against the use of protected data is because of past experiences using basic privacy techniques.
The use of a cutting-edge privacy platform, such as Privitar, will allow your organization to decrease the risk of friction when accessing sensitive data sources, enabling you to spend significantly less time organizing and collating data and more time gaining critical insights from the analysis.