Using Protected Characteristics Responsibly to Build Better AI

By Marcus Grazette, Europe Policy Lead at Privitar

Our latest data policy event, which we held in collaboration with One HealthTech London, explored the interaction between health data and characteristics such as gender and ethnicity in the context of data-driven healthcare. This is a long-running debate. Among others, Caroline Criado-Perez argues that failing to account for gender in drug trials hampers drug discovery, and Vyas, Eisenstein and Jones highlight the risk that race-adjusted clinical algorithms perpetuate inequality.

COVID-19 Highlights The Issues

The COVID-19 pandemic has pulled these issues out of the academic journals and into the headlines. We’ve seen emerging evidence that the virus disproportionately affects men from Black and Asian communities, alongside concern that some of the measures taken in response to the pandemic could widen existing health inequalities.

A Panel Discussion Of Using Data Responsibly

We convened a stellar panel to guide us through the policy, legal and data science implications of using data on ethnic origin in a medical context: Sonia Patel, Chief Information Officer at NHS X, Cerys Wyn Davies, Partner at Pinsent Masons and Noa Dagan, Head of AI and Data Driven Medicine at the Clalit Research Institute. Sonia started by reminding us that the NHS already has a duty to address health inequalities. Collecting good quality data, including on ethnicity, allows us to understand the differences within populations and improve the outcomes for each individual. Research suggests that the completeness has improved dramatically: on average 27% of patients in primary care data spanning 1990 – 2012 have ethnicity recorded, rising to 78% for patients registered after 2006. This is great progress, but Sonia flagged two challenges. First, collecting the data relies on a clinician asking a patient about their ethnicity. Those conversations can be challenging, so both clinician and patient may need support to feel comfortable. Second, ensuring that the data is used responsibly – to deliver better care, improve services and in research – so that everyone sees the benefit of collecting it in the first place. Cerys focused on the legal and regulatory aspects of responsible data use. She picked up Sonia’s point about using the data for everyone’s benefit, using the push for personalised medicine across the life sciences sector as an example. Personalisation can allow us to account for the differences between different groups of patients and to tailor treatment.

Personalised Medicine With Protected Characteristics

The law allows us to use data to achieve this. The protected characteristics arise from UK equalities legislation, which bans discrimination on the basis of nine characteristics including race, sex and age. Special category data, which is subject to additional protection under the GDPR, includes all health-related data and overlaps with some of the protected characteristics. Regulators recognise the benefits of using these types of data for scientific research so have taken a pragmatic approach, for example by publishing specific guidance on health research. But pragmatism isn’t the end of the story. Organisations sharing health data recognise that the rules exist to protect individuals, so have to grapple with difficult regulatory questions. It may never be possible to reduce the risks associated with data sharing (e.g. re-identification, data misuse) to zero, but the pandemic shows a shared willingness to accept a level of risk in order to make progress. In addition, taking the time to work through issues like international transfers, legitimate interest, accountability and non-discrimination helps to improve data driven projects. It also builds trust with stakeholders, including the patients whose data is being shared.

Balancing Risk And Progress

Assuming that we’ve collected the data and decided that we can share it, Noa focused on how data scientists can use data on protected characteristics in their research. Achieving the vision of personalised, predictive medicine powered by AI will only be possible if the models can make predictions that are accurate enough to be useful. This raises a model calibration question: not of how accurate the model is in general, but how accurate it is at making predictions for specific subpopulations and at accounting for the intersections between subpopulations. For example, x% accurate overall but only y% accurate for females and z% for Caucasian females aged 50-55. That leaves data scientists with two big questions. First, should I include data on protected characteristics in my model at the risk of perpetuating historical bias? Second, if I include the data, how to ensure that my model is properly calibrated to use it effectively? Noa’s forthcoming paper on addressing fairness in prediction models by improving subpopulation calibration answers the second question. The first is much harder. Watch the event recording to hear Noa’s take.

Stay Tuned To The Data Policy Network

We regularly organise events bringing experts together to discuss topical, challenging data policy questions. Join our Data Policy Network for details of upcoming events and find out how you can join the conversation.

Compliance

CCPA Checklist & Right to Deletion

Infographic: The State of Data Analytics

View our current opportunities