Privitar Labs

Protecting sensitive information in unstructured text

Unstructured Text

Emails, documents, chat logs, patient notes and speech to text output all contain free-form text that does not conform to a particular tabular structure or schema. Together with images, video and audio, it is estimated that unstructured data will account for 80% of all data by 2025.

Important insights can be gathered from unstructured data, such as understanding relationships between patient symptoms, diagnoses and patient outcomes from medical notes; or measuring customer satisfaction and the root cause of problems from online chat logs.

Unstructured text has historically been a relatively unmined data source due to the difficulty of processing and extracting what’s important. But new techniques in natural language processing have begun to enable analytics on unstructured data, leading in turn to a pressure to open up that data for safe use, while respecting privacy.

Privacy challenges

Unstructured data brings novel privacy challenges. Free text data can contain sensitive or confidential information such as names, addresses, phone numbers, references and other identifiers. It is also very rich, with much contextual information embedded in the phrasing and structure.

Privacy risks in unstructured text can be subtle. For example, free text medical notes might not contain a patient’s name. However a description of the date and circumstances under which they sustained an injury and their subsequent treatment, could be enough to identify that patient. Information about hereditary medical conditions could disclose sensitive information about their family members.

Sometimes the absence of a term may
be significant. This makes quantification
of privacy risk in unstructured text even more difficult than in structured data. Organisations may struggle to eliminate the risk entirely, but they can reduce it significantly, and what is sufficient depends on who is analysing the data, and for what purpose.

Privacy Techniques and approaches

One approach is to find and remove identifiers before the analytics task takes place. Unlike in tabular data, this identifying information can appear anywhere in the free text, so the main challenge is to recognise and classify which parts of the text contains potential identifiers. There are two valid approaches to protecting unstructured data, each with strengths and weaknesses.

1. Using machine learning to find and remove identifying data

Most simply and obviously, pattern matching can be used to find and redact things like email addresses and social security numbers. More sophisticated approaches employ machine learning to interpret the context of a document - for example using named entity recognition models to find identifying fields based on how they are used in a sentence. Machine learning models may be used to detect fields such as names, locations, organisations, dates and phone numbers in free-form text. Rapid progress has been made in this area and we can now expect algorithms to detect up to 95% of direct identifiers this way. These techniques rely on access to large amounts of training data - free text records where direct identifiers have been accurately labelled. Better performance can be achieved when the training data is specific to the task domain - for example using language and identifiers specific to healthcare or financial services. Often it is not possible to obtain this domain-specific training data, so models can be trained on more generic data and then adapted to the domain using a smaller specific dataset. De-identification has its limits However this approach will never be perfect, and we do not expect it to be able to fully anonymise text, nor to be able to detect more nuanced identifying descriptions. Free text contains so much context that it is difficult to formally guarantee privacy is protected. Such a redaction approach can be suitable for protecting datasets for internal use, for example in providing de-identified chat logs to trusted data scientists for model training and analysis.

2. Bringing the analytics to the data

An alternative approach is to not grant access to the data directly, even in de-identified form, but instead to bring the analytics to the data. The user is allowed to ask restricted statistical queries of the data, or to bring code to be run against the data, and only sees the result of that computation, potentially further protected by differential privacy.

Here the challenge is in developing and testing these safe analytic processes. There is typically a need for de-identified data (or synthetic data) to create and test this analytic machinery, before it is ready to be used against real data. Hence the two approaches are complementary.

At Privitar, we are developing privacy techniques taking these complementary approaches. We are evaluating the accuracy of de-identification performed by machine learning models tuned for specific industries, and are working to ensure that the output data can still be used with high accuracy by data analysts and for AI. We’re keen to collaborate with customers seeking to use unstructured data safely.

Team up with Privitar Labs

Do contact us if you’d like our help in providing privacy protection to enable analytics and processing of unstructured text data.