The Big Data paradigm assumes that more data means better machine learning (ML) models, generating deeper insights and better predictions. This framing can imply a tension between data minimisation and building high performing ML models. Often, this isn’t true. Data minimisation can have a positive impact on ML model development and improve privacy. Our most recent Data Policy Network evening explored this debate, and this follow up blog sets out the Privitar view.

The Big Data paradigm gained traction in the 2000s. The cost of storage fell and the sources of data multiplied. Organisations built vast data lakes, often on the assumption that more was better. As the European Data Protection Supervisor noted, “the perceived opportunities in big data provide incentives to collect as much data as possible and to retain this data as long as possible for yet unidentified future purposes”.[1]

However data protection regimes require organisations to comply with the data minimisation principle when they process personal data, including when using it to build ML models. In Europe, this states that personal data should be “adequate, relevant and limited to what is necessary in relation to the purposes” for which the data are processed. This is not a new requirement, it has existed since the 1995 Directive.[2] Similar requirements exist in the US. The NIST privacy engineering goals similarly require that personal data is not processed “beyond the operational requirements of the system”.[3]

On the face of it, the two concepts are in tension. A data scientist might argue that her work is exploratory; she doesn’t know in advance what correlations her model may unearth so the training dataset needs to be as large as possible or she risks missing an innovative insight. But using all the available data on the basis it could be necessary contravenes the data minimisation principle. The ICO, which regulates data protection in the UK, states that “finding the correlation does not retrospectively justify obtaining the data”.[4]

But we can dig deeper, by challenging two fundamental assumptions. First, more is not always ‘better’. Second, we should define ‘better’ broadly. Not just in terms of model performance, but also on the basis of other important factors like social impact. We’ll examine these in turn.

Before we can challenge the first assumption, we should be clear about what we mean by more data. For example, let’s assume that our data scientist builds a model to predict whether an individual is likely to miss a loan repayment. She needs a training dataset, including examples of individuals who have either missed repayments or not and some attributes of those individuals (e.g. age, income, employment status, etc). More data might mean increasing the number of individuals, using a million examples instead of a thousand. Or it might mean increasing the number of values about each individual, adding ZIP code, marital status, level of education, etc. In the age of big data, the list of possible attributes could be very long: the Alteryx breach exposed a dataset containing 248 attributes for 120 million US households.[5] The three graphs below show why adding more examples or more attributes might not always improve the model, particularly if the attributes are irrelevant to the prediction that the model aims to produce.

Researchers have shown that some models reach a performance plateau, meaning that adding more examples does not increase performance.[6] It’s important to remember that adding data comes with cost, in terms of compute time, storage and increased privacy risk. Effective data minimisation can reduce those costs, as you’re not processing data you don’t need.

In other cases, our data scientist will come up against the so-called ‘curse of dimensionality’. This was first described by mathematician Richard Bellman in 1961 and shows that a model can achieve peak performance when supplied with an optimal number of features or attributes. Adding more is counterproductive. In this case, data minimisation will mean carefully selecting the most relevant attributes for the use case.[7]

Data minimisation can also improve privacy. Adding a large number of variables can lead to overfitting, where the the model learns very specific features of the training data. The graph on the left shows two versions of a model classifying data points into two groups, either red or blue. The smooth purple line shows a simple model that sometimes gets it wrong. The orange line shows a much more complicated model suffering from overfitting. As Privitar CTO Jason McFall explained, the latter is vulnerable to membership inference attacks. An attacker could infer that an individual with specific characteristics was in the training dataset if the model’s predictictions about that individual’s carry a very high degree of confidence. A complex, overfitted model will also perform less well on new data. [8]

Challenging the second assumption means thinking broadly about our definition of ‘better’. We already do this in everyday life. We decide which car is better not just on how quickly it accelerates but also on running costs, safety and resale value. Crunching through only the data we need could mitigate privacy risk, reduce the climate footprint of data centres and still produce usable predictions. ‘Better’ models should also be those which support privacy by design and are simple enough to be explainable, in line with transparency principles and individual expectations.

Developing better AI systems means rethinking basic assumptions. These examples show that data minimisation can help to protect privacy, support regulatory compliance and build better machine learning models. Leading organisations will embrace this. It will mean that their AI projects are future-proof, compliant and align with their values.

  1. EPDS, Opinion 7/2015, Meeting the challenges of big data, 19 Nov 2015
  2. GDPR, Article 5(1)(c) and 1995 Directive, Article 6(1)(c)
  3. NIST, Privacy Framework
  4. ICO, Big data, artificial intelligence, machine learning and data protection, 4 Sept 2017
  5. Forbes, 120 Million American Households Exposed In ‘Massive’ ConsumerView Database Leak, 19 Dec 2017
  6. Karin Kruup, Clearing the buzzwords in machine learning, 23 May 2018
  7. Open Data Science, Confronting the Curse of Dimensionality, 4 April 2019
  8. Jason McFall, In:Confidence 2019