For many organizations, social media, chatbots, and SMS messaging have become increasingly popular methods to connect with customers. Not only do they provide convenient ways to communicate, but they also generate a large amount of conversational and unstructured data that contain lots of valuable information—details about product or service issues, sentiment, feedback, and more—which is all immensely helpful to the organizations.
On the flip side, that data can also contain personally identifiable information—such as name, address or location, organization, phone number, email address, credit card number, date of birth, and so on. That information can reveal an individual’s identity and is subject to regulatory control. As a result, access to this conversational text is often limited or completely restricted in its raw form, and not used for analytics.
What if I told you that didn’t have to be the case? That you didn’t have to leave this information on the shelf or lock it away in a vault?
Rather, with the right controls in place, you can safely leverage this conversational text for analysis and gain access to a whole new world of insights and information.
I’m delighted to say that you now can.
Privitar recently released a new offering, Data Privacy for Chat, which makes that possible.
Responsibly analyzing your unstructured free text data sources must start with protecting sensitive data. De-identification enables you to analyze your data without exposing sensitive, protected information and allows you to provision it to a wider group of analysts while upholding customer privacy and ensuring regulatory compliance. Privitar Data Privacy for Chat makes it easy for you to protect the sensitive data found in your conversation text from chat logs and social media feeds like Twitter or Facebook Messenger.
The journey to Data Privacy for Chat began in 2016 with an innocuous engineering ticket with the title “masking in unstructured text.” The ticket underscored what we already knew—that there was significant risk for data in unstructured text—but also that there was no good solution for it at the time. And Privitar was still in the early days and focused on building out our core product, so we couldn’t commit the resources to investigate this further at the time.
Fast forward a year later, when I returned from a vacation with an idea to explore—we could use a neural network-based entity tagger to locate sensitive information in free text and apply tokenization rules. The team ideated further and fleshed out the idea later that year at Privitar’s first hackathon. At the end, we presented the business case and the technical solution. The business case had everyone sold. The technical solution, however, wasn’t there.
State of the art back in 2017 simply wasn’t accurate enough. The models still needed enormous amounts of annotated data to train them. However, a client engagement where we removed commercially identifying information from a dataset drove home the point that this was an important area. It required nesting complicated detection algorithms that were difficult to troubleshoot and support. More research was needed.
There were two problems:
1) Clients and Privitar were hoping for 100% accuracy (or very close to it).
2) State-of-the-art technology at the time would only hit 80% (and on real datasets the accuracy would drop to about 65%).
We continued to research.
Over the next few years several things happened. Most importantly, the technology improved considerably. Word embeddings are at the heart of modern natural language processing. Google brought transformers to embeddings in the form of Bert, and Flair gave the world a framework to easily experiment with their new context-aware embeddings. I also took on a new role on our research team, and I had the opportunity to work on unstructured text full time and mostly uninterrupted.
We built and polished off our prototype, updated it for the technology available, and took it to clients. Could this be useful for them? It was a resounding yes.
Privitar Data Privacy for Chat recognizes and classifies the unstructured free text, then applies policies to de-identify the sensitive data. When coupled with Privitar’s enterprise-grade structured data privacy platform and industry-leading privacy enhancing technologies, you can take advantage of the same policies and features, consistently tokenize identifiers in unstructured chat logs with data found in the structured systems, and close a loop on customer interactions (for example, by including chat conversations in that sentiment analysis). We’ve leveraged natural language processing (NLP) techniques, and integrated AI and deep learning models to mitigate privacy risks when analyzing conversational chat data.
Privitar Data Privacy for Chat is a powerful tool in your toolkit if you want to take advantage of the data contained within your conversational text sources, but not compromise the privacy of the data or the data subjects.
We’ve already seen some incredible results. We built the product following successful research engagements with our customers ABN AMRO and Discovery, who came to Privitar seeking innovative ways to leverage privacy enhancing technologies to meet their needs to protect identifying information in unstructured text (you can check out the case studies here).
We’re also testing and validating additional use cases that we may support down the line (for example, de-identifying sensitive data in longer form text, images, and live chats), and we will continue to evolve the product.
Privitar Data Privacy for Chat combines a set of de-identification transformations with advanced deep learning models for natural language processing to allow users to apply privacy policies on English-language conversational text data.
A state-of-the-art deep learning model locates and classifies sensitive data within a block of text. A transformation engine applies Privitar’s policy rules to the classified text. We have made it available as a Java-based SDK to make it easy to incorporate it into any data pipeline.
The model is written using PyTorch™ and uses a combination of publicly available frameworks (Flair, GloVe, BERT) and custom network architectures to deliver high-performance classification on real-world data.
The model used for classification is trained using patent-pending active learning techniques. Active learning involves a human in the loop during model training and the model itself is used to select records to label. This allows us to easily add new classes of sensitive information or refine the model to work on custom linguistic styles with lower labeling effort.
The transformation engine uses Privitar policies to determine how it should de-identify text (and any accompanying structured data). This allows for fine-grained control over how text is de-identified.
The product is made available as a Java-based SDK. This allows integration with the most popular data processing tools. Hadoop®, Spark®, NiFi®, Flink®, Kafka®, and other open-source tools can easily call the SDK to de-identify text. Commercial tools such as Databricks™, Confluent®, and Streamsets® can also easily be integrated. If you need help in the process, Privitar’s professional services team can provide sample code to accelerate your integration as well as expert advice on crafting the most suitable privacy policies for you.
Privitar Data Privacy for Chat helps solve an acute pain point for many organizations: de-identifying short, free-form text messages. With Privitar Data Privacy for Chat, you can mitigate privacy risks when analyzing conversational chat data, enabling you to tap into the value of this data and open up a new world of insights for your organization.
Our team of data privacy experts is here to answer your questions and discuss how data privacy can fuel your business.