Comparisons of risk prediction algorithms using three clinical research databases (QResearch, CPRD Aurum and CPRD Gold)

Status

Ongoing

Title

What is the aim of the study and why is it important?

QResearch, CPRD Gold and CPRD Aurum databases are three large general practice databases which are widely used for research. The CPRD databases are similar to the QResearch database but contain a different group of practices and are linked to different external data sources.

Our study will compare characteristics of the database such including how common various diseases are and how well the data are recorded on each. We will develop new prediction algorithms and compare them with existing algorithms. We will then check to see how well various risk algorithms work in each data source. For example we will develop algorithms on QResearch and test them on the both CPRD databases.

Risk algorithms are tools which work out the chances that a patient has got or might develop a disease in the future (such as diabetes or cancer), based on information about them such as their age, sex, ethnicity and illnesses and treatments. In clinical practice, such tools can be used to help patients understand their risk of different diseases and identify those who might need help to reduce their risk or referral to hospital for tests. In this study, we want to see how well these algorithms work on each database and to understand similarities and differences between the databases which will help us interpret the results.

How is the research being done?

OBJECTIVE 1

To identify and quantify systematic differences between the three UK research databases (QResearch, CPRD Gold and CPRD Aurum) including geographical spread, diversity of the registered patient population, and clinical coding.

We will analyse GP linked data to assess completeness of recording of outcomes (e.g. diabetes) by examining the number of cases recorded on the following data sources (below). This will then be analysed to understand differences and similarities between the databases
(a) GP record alone
(b) GP record or deaths record
(c) GP or HES record
(d) GP or HES or death record
(e) GP or HES or death or cancer registry

We will also compare rates with external data sources where available. For example we compare mortality rates with published statistics from the Office of National Statistics.

OBJECTIVE 2:

To validate the performance of multiple new and existing risk prediction algorithms for identifying patients at risk of different types of outcomes on an each of the three databases (CPRD Gold and Aurum). This includes the assessment of discrimination, calibration, decision curve analysis, sensitivity, specificity, positive and negative predictive values at different thresholds and with and without accounting of competing risk of death. It will also include cross validation by developing some models in each database and validating it in the other two.

Combined with objective 1, we will determine whether the inclusion of linked data materially affects the calibration or discrimination of the algorithms.

Chief Investigator

Professor Julia Hippisley-Cox

Sponsor

Oxford

Location of research

University of Oxford

Date on which research approved

22-May-2023

Project reference ID

OX330

Generic ethics approval reference

18/EM/0400

Are all data accessed are in anonymised form?

Yes

Brief summary of the dataset to be released (including any sensitive data)

We will undertake cohort studies in a large population of primary care patients from an open cohort using data from all three database - CPRD Gold and Aurum databases and QResearch. We will include all practices which have been using their current GP clinical computer system for at least a year. We will use the latest data available from each database at the time of the analysis. We will identify cohorts from each database which will include patients registered with practices on or after 01 Jan 1998 until the latest date for which data are available at the time of the study.

We will use the GP data linked to hospital episode statistics, cancer registry and mortality.

1. GP DATA
demographics including age, sex, ethnicity, deprivation, region
clinical diagnoses - major chronic diseases e.g. diabetes, cardiovascular disease, thromboembolism, cancer, fracture, haemorrhage
clinical values e.g. body mass index, smoking, alcohol
laboratory investigations e.g. full blood count, electrolytes, liver function tests, CA125
commonly prescribed medication

2. HES Data
HES data to identify outcomes of interest e.g. diabetes, cardiovascular disease, thromboembolism, cancer, fracture, haemorrhage

3. Mortality Data
mortality data to identify outcomes of interest on the death certificate e.g. diabetes, cardiovascular disease, thromboembolism, cancer, fracture, haemorrhage and cancer treatments.

4. Cancer Registry Data
Cancer registry data to identify characteristics of cancers e.g. type, location, stage, grade, route to diagnosis, treatments (e.g. chemotherapy, radiotherapy, hormonal, surgery)

Funding Source

John Fell Fund

Research Team

Julia Hippisley-Cox, University of Oxford

Carol AC Coupland, University of Oxford

Mona Bafadhel, King’s College London

Richard EK Russell, King’s College London

Aziz Sheikh, University of Edinburgh

Peter Brindle, University of Bristol

Keith M. Channon, University of Oxford

Approval Letter

Download Approval Letter

Publications

Development and validation of a new algorithm for improved cardiovascular risk prediction
Authors: Hippisley-Cox J, Coupland CAC, Bafadhel M, Russell REK, Sheikh A, Brindle P, Channon KM
Ref:
https://www.nature.com/articles/s41591-024-02905-y

Press Releases

Access Type

Trusted Research Environment (TRE)

›

Comparisons of risk prediction algorithms using three clinical research databases (QResearch, CPRD Aurum and CPRD Gold)

Status

Title

What is the aim of the study and why is it important?

How is the research being done?

Chief Investigator

Lead Applicant Organisation Name

Sponsor

Location of research

Date on which research approved

Project reference ID

Generic ethics approval reference

Are all data accessed are in anonymised form?

Brief summary of the dataset to be released (including any sensitive data)

Funding Source

Public Benefit Statement

Research Team

Approval Letter

Publications

Press Releases

Access Type

Share this