Utilizing Big Data for Clinical Trials: Opportunities and Challenges

The more and more people believe that big data could have great impact to future clinical development. There is vast amount of health-related data available publicly and privately. In public domain, CTGOV (Clinicaltrial.gov), SEER (NIH) and Sentinel (FDA) are major source of clinical trial, cancer registry and drug safety. CTGOV contains information of trial design, inclusion/exclusion criteria, safety updated or summary of study results for nearly 300,000 registered clinical trials. SEER collects cancer incidence data on patient demographics, primary tumor site, tumor morphology, stage at diagnosis, and first course of treatment, and they follow up with patients for vital status. It covers approximately 34.6% of the U.S. population. Sentinel is the FDA’s national electronic system which monitors the safety of FDA-regulated medical products, including drugs, vaccines, biologics, and medical devices. It contains 300 million person-years of high quality, unduplicated, curated data.

On September 13, 2017, Duke-Margolis Center for Health Policy held a symposium on Building A Framework for Regulatory Use of Real-World Evidence. In this event, experts and FDA speakers discussed broader topics surrounding big healthcare data and RWE. On October 1^st, 2018, the center held the second symposium on this topic.

September 04, 2018 – The FDA Commissioner Scott Gottlieb, MD posted a blog to express his view on utilizing cost-effective strategies and big data to accelerate clinical trial efficiency, medical product development, and innovations in artificial intelligence. His view unquestionably reflects FDA’s position on big data for future clinical trials.

These events clearly indicate the trend and change to be emerged in regulatory landscape.

Let us go back to the question of how to utilize these big data for drug development and clinical trials. To address this question, let us review some basic concepts of clinical research and some characteristics of big data.

There are two categories of clinical researches: interventional and observational. Most of drug trials are interventional in which the objectives are to evaluate whether a test drug is superior to a standard of care (SOC) in terms of safety and efficacy and the randomized experimental design is commonly used. This is so called Randomized Controlled Trial (RCT). The purpose is to ensure the comparability and reduce the assessment bias. There are fundamental probability theories to support RCT, such as the Central Limit Theory (CLT) which says that the mean of measurement sampled from independent and identical population is approximately normal distributed. Thus, one can calculate the probability of false positive if the test drug was actually as the same as SOC in terms of treatment. The purpose of the strict inclusion and exclusion criteria in RCT is to have a study population as homogeneous as possible so that the CLT can be applied. That is why RCT is also called controlled clinical trial.

Most of Big Data in health and medical fields are observational collected routinely from different sources, such as the Electronic Health Records (HER), Electronic Medical Records (EMR), insurance claims, government registry (e.g. CTGOV, SEER), etc. These databases contain rich information on disease and treatment in real-world setting. Unlike the controlled clinical trials, many classical statistical methods couldn’t be directly applied.

There are two types of big data: one of which is static, such as SEER containing cancer incidences from 1973-2015. The static data is relatively invariant over time. This type of data may be suitable for being used as historic reference. There are several applications can be considered:

Guide clinical strategy
Optimize study design
Identify right patient population
Identify prognostic factors
Identify drug safety signals
Predict study success rate for current and/or future trials

Another type of big data is dynamic in which new data are generated continuously, such as from wearable devices (e.g. continuous glucose meter), spontaneous safety report, etc. This type of data may be useful for trials with prospective objectives. There are several applications can be considered:

What is the better treatment for different patient populations
Real World Evidence for marketed drugs
Select an uncontrolled arm for your study, for example the suitable standard care group
Dynamic incidence estimation (e.g. signal detection in PV, cancer survival rate)

What could be the challenges to utilize dig data in clinical trial. The following may be some challenges to be faced.

Due to the uncontrolled nature, confounding factors may create the difficulty for data interpretation.
Unstructured and/or unformatted data may create difficulty for analysis
Incomplete data may create difficulty for statistical modeling
Non-homogenous patient population may create difficulty for influential testing (such as the p-value and statistical power)

Although we believe that big data will have great impact to future clinical trials, there are still many hurdles on the way.

About the author

Tai Xie President and CEO Brightech

Dr. Xie has 23 years of pharmaceutical experience in line of management and statistical analysis for Phase I-IV clinical trials as well as integrated summaries of safety (ISS) and efficacy (ISE) for new drug applications (NDA) in various therapeutic areas, especially in oncology