QResearch Logo

Comparisons of risk prediction algorithms using three clinical research databases (QResearch, CPRD Aurum and CPRD Gold)




Comparisons of risk prediction algorithms using three clinical research databases (QResearch, CPRD Aurum and CPRD Gold)

What is the aim of the study and why is it important?

QResearch, CPRD Gold and CPRD Aurum databases are three large general practice databases which are widely used for research. The CPRD databases are similar to the QResearch database but contain a different group of practices and are linked to different external data sources.

Our study will compare characteristics of the database such including how common various diseases are and how well the data are recorded on each. We will develop new prediction algorithms and compare them with existing algorithms. We will then check to see how well various risk algorithms work in each data source. For example we will develop algorithms on QResearch and test them on the both CPRD databases.

Risk algorithms are tools which work out the chances that a patient has got or might develop a disease in the future (such as diabetes or cancer), based on information about them such as their age, sex, ethnicity and illnesses and treatments. In clinical practice, such tools can be used to help patients understand their risk of different diseases and identify those who might need help to reduce their risk or referral to hospital for tests. In this study, we want to see how well these algorithms work on each database and to understand similarities and differences between the databases which will help us interpret the results.

How is the research being done?


To identify and quantify systematic differences between the three UK research databases (QResearch, CPRD Gold and CPRD Aurum) including geographical spread, diversity of the registered patient population, and clinical coding.

We will analyse GP linked data to assess completeness of recording of outcomes (e.g. diabetes) by examining the number of cases recorded on the following data sources (below). This will then be analysed to understand differences and similarities between the databases
(a) GP record alone
(b) GP record or deaths record
(c) GP or HES record
(d) GP or HES or death record
(e) GP or HES or death or cancer registry

We will also compare rates with external data sources where available. For example we compare mortality rates with published statistics from the Office of National Statistics.


To validate the performance of multiple new and existing risk prediction algorithms for identifying patients at risk of different types of outcomes on an each of the three databases (CPRD Gold and Aurum). This includes the assessment of discrimination, calibration, decision curve analysis, sensitivity, specificity, positive and negative predictive values at different thresholds and with and without accounting of competing risk of death. It will also include cross validation by developing some models in each database and validating it in the other two.

Combined with objective 1, we will determine whether the inclusion of linked data materially affects the calibration or discrimination of the algorithms.

Chief Investigator

Professor Julia Hippisley-Cox

Lead Applicant Organisation Name



Location of research

University of Oxford

Date on which research approved


Project reference ID


Generic ethics approval reference


Are all data accessed are in anonymised form?


Brief summary of the dataset to be released (including any sensitive data)

We will undertake cohort studies in a large population of primary care patients from an open cohort using data from all three database - CPRD Gold and Aurum databases and QResearch. We will include all practices which have been using their current GP clinical computer system for at least a year. We will use the latest data available from each database at the time of the analysis. We will identify cohorts from each database which will include patients registered with practices on or after 01 Jan 1998 until the latest date for which data are available at the time of the study.

We will use the GP data linked to hospital episode statistics, cancer registry and mortality.

demographics including age, sex, ethnicity, deprivation, region
clinical diagnoses - major chronic diseases e.g. diabetes, cardiovascular disease, thromboembolism, cancer, fracture, haemorrhage
clinical values e.g. body mass index, smoking, alcohol
laboratory investigations e.g. full blood count, electrolytes, liver function tests, CA125
commonly prescribed medication

2. HES Data
HES data to identify outcomes of interest e.g. diabetes, cardiovascular disease, thromboembolism, cancer, fracture, haemorrhage

3. Mortality Data
mortality data to identify outcomes of interest on the death certificate e.g. diabetes, cardiovascular disease, thromboembolism, cancer, fracture, haemorrhage and cancer treatments.

4. Cancer Registry Data
Cancer registry data to identify characteristics of cancers e.g. type, location, stage, grade, route to diagnosis, treatments (e.g. chemotherapy, radiotherapy, hormonal, surgery)

Funding Source

John Fell Fund

Public Benefit Statement

Research Team

Julia Hippisley-Cox, University of Oxford

Carol AC Coupland,  University of Oxford

Mona Bafadhel, King’s College London

Richard EK Russell, King’s College London

Aziz Sheikh, University of Edinburgh

Peter Brindle, University of Bristol

Keith M. Channon, University of Oxford

Access Type

Trusted Research Environment (TRE)

Share this