Emails, documents, chat logs, patient notes and speech to text output all contain free-form text that does not conform to a particular tabular structure or schema. Together with images, video and audio, it is estimated that unstructured data will account for 80% of all data by 2025.
Important insights can be gathered from unstructured data, such as understanding relationships between patient symptoms, diagnoses and patient outcomes from medical notes; or measuring customer satisfaction and the root cause of problems from online chat logs.
Unstructured text has historically been a relatively unmined data source due to the difficulty of processing and extracting what’s important. But new techniques in natural language processing have begun to enable analytics on unstructured data, leading in turn to a pressure to open up that data for safe use, while respecting privacy.
An alternative approach is to not grant access to the data directly, even in de-identified form, but instead to bring the analytics to the data. The user is allowed to ask restricted statistical queries of the data, or to bring code to be run against the data, and only sees the result of that computation, potentially further protected by differential privacy.
Here the challenge is in developing and testing these safe analytic processes. There is typically a need for de-identified data (or synthetic data) to create and test this analytic machinery, before it is ready to be used against real data. Hence the two approaches are complementary.
At Privitar, we are developing privacy techniques taking these complementary approaches. We are evaluating the accuracy of de-identification performed by machine learning models tuned for specific industries, and are working to ensure that the output data can still be used with high accuracy by data analysts and for AI. We’re keen to collaborate with customers seeking to use unstructured data safely.
Do contact us if you’d like our help in providing privacy protection to enable analytics and processing of unstructured text data.