By Marcus Grazette, Europe Policy Lead at Privitar

AI and machine learning (ML) technologies are helping people do remarkable things and becoming more widely used. With the increased interest in using these technologies, there is also increased regulatory interest in AI and ML, with regulators increasingly engaging proactively with industry to help shape thinking. The Information Commissioner’s Office (ICO)’s Project ExplAIn, on AI explainability under the GDPR, and the European Commission’s White Paper on AI, a proposal for the future of AI regulation, are recent examples.

Linked to Project ExplAIn, the ICO recently hosted a workshop on data minimisation and machine learning. Privitar participated alongside four leading technology companies including Google and Facebook. The challenges relating to data minimisation and machine learning are well documented and I’ve argued in a previous blog that applying effective data minimisation can improve the ML development process.

But the recent workshop considered a different angle – how could an organisation demonstrate compliance with the data minimisation principle when running a ML project? The GDPR’s accountability principle requires that organisations are able to demonstrate compliance, meaning that it’s not enough to comply with data minimisation, you also have to be able to demonstrate that you have complied.

That brings us back to the data. A machine learning model looks for patterns in input or training data and applies those patterns to new data in order to make a decision – which could be a prediction or a classification. A model will  perform well when presented with new data that resembles the training data. That makes the training data used a hugely important part of understanding the model’s decisions.

With that in mind, the ICO’s guidance encourages organisations to collect and process training data in an “explanation aware” manner. European regulators take a similar view. The European Commission’s White Paper on AI frames training data as core to an AI system’s performance. The Commission proposes three requirements for training data:

  • Safety, the data should be sufficiently broad to ensure that the AI system can avoid dangerous situations.
  • Non-discrimination, the training data should be sufficiently representative.
  • Privacy and personal data protection, linking back to the GDPR.

It also proposes that organisations document the training dataset (i.e. its characteristics, what values were selected for inclusion, etc.) and in some cases retain a copy of the training data itself to allow issues with the model’s performance to be traced and understood.

The White Paper is a consultation document, so it’s too early to say whether these specific recommendations will make it into law. However, it’s clear that the trend is towards a greater focus on training data as a key element of building compliant machine learning systems.

There are a number of practical steps that organisations can take to help ensure compliance. They include carefully documenting any pre-processing – including transformation to protect individual privacy like pseudonymisation – and decisions about what data to include in the training dataset. Centralised privacy management can help.

Centralising privacy management offers a number of advantages. First, it fosters a consistent approach across an organisation by creating a central forum for decisions about pre-processing to take place. In contrast, an ad hoc project-specific approach can be slow, inconsistent and complicated to audit. Second, centralisation allows you to document transformations applied to the data (e.g. tokenization). That can help to speed up data preparation for an ML project, because decisions on how to construct the training dataset can be taken once then applied consistently. Incidentally, documenting transformations supports compliance with the GDPR requirement to record processing (Article 30) and explainability in the context of the ICO’s guidance.

At a strategic level, a culture of accountability can help to drive innovation. Multidisciplinary teams of engineers, risk experts and business line leaders can work together on ML projects that use only the data they need in order to answer the most pressing business questions you face.

Interested in learning more on this topic? Check out this blog post on how data privacy can help data scientists.