Reproducibility of Statistically Significant Phase III Oncology Trials: An in Silico Meta-Epidemiological Analysis

Phase III randomized trials are the gold standard for evaluating comparative efficacy and require considerable resources. In other fields of clinical medicine, it is common for regulatory approvals to be granted only after at least two phase 3 trials have shown comparative efficacy for the new treatment.

In oncology, however, this is rarely the case, especially in light of the substantial number of patient subgroups with heterogeneous biology being examined, as well as the challenge of equipoise. In the setting of a positive trial in oncology, it is often difficult to ask patients as well as physicians to entertain another randomization, especially when outcomes are poor.

Thus, a key question that remains unanswered in late phase oncology research is whether the statistically significant effects that we observe in phase 3 trials and interpret as indicative of therapeutic superiority would actually be replicated if a similar trial were conducted. This question stems from the foundational scientific notion that experiments in general should be replicated in order to increase our certainty of the observed results, accompanied by statistical inferences that aim to estimate long range effects.

Knowing that we will likely never see attempts at replication of many phase 3 trials in the near future, we applied an existing statistical methodology to estimate replication probabilities for effect sizes observed at different P values. The foundation of this methodology was developed over a series of prior papers that demonstrated how mixture models, fit to the distribution of observed treatment effects for a collection of similar trials, can be leveraged to understand important properties about the true treatment effects for clinical trials, with those properties that otherwise cannot be observed for any individual trial.^1,2

Using a dataset of 632 phase 3 oncology trials published in the modern era enrolling nearly half a million patients, we fit a model to the distribution of the observed treatment effects to estimate several interesting metrics. For trials where the original P value was 0.05, we estimated that the average probability of successfully replicating those trials was approximately 43%. Replication probability was still only 60% for trials where the original P value of 0.01 and 77% for trials where the original P value was 0.001.

These findings raise two key points: 1) Statistically significant P values using the standard 0.05 threshold should not be conflated with high probabilities of successful replication, and 2) attributing the reproducibility crisis in science towards poor study design, fraud, or otherwise may overlook the inherently fragile nature of P value dichotomization at the conventional thresholds, which are again not associated with successful replication.

While these replication probabilities of statistically significant phase 3 oncology trials that we use to justify standard of care treatment may seem concerning, there is a silver lining. We also found that the probability of the correct direction of treatment effects was high for trials with significant P values, even those at P of 0.05. This provides us with some reassurance that our current approaches to significance interpretation follow the principle of “do no harm,” in that it is very unlikely we would observe effects in the opposite direction using current P value significance thresholds.

Because these concepts are so important for understanding new phase III trial results, we have created a web interface where these quantities can be estimated by anyone for phase III trials of interest: https://alexandersherry.shinyapps.io/shinyapp_replication_probability/.

We encourage oncologists and other stakeholders in the oncology community to use these tools and reflect on the probabilities of replication and treatment effect directionality when interpreting new studies to help guide the conversation beyond the P < 0.05 ritual.

Written by: Alexander D. Sherry,¹ Pavlos Msaouel,² Avital M. Miller,³ Timothy A. Lin,³ Joseph Abi Jaoude,⁴ Ramez Kouzy,³ Adina H. Passy,³ Tomer Meirson,⁵ Nikolaos Ignatiadis,⁶ Zachary R. McCaw,⁷ Erik van Zwet,⁸ Ethan B. Ludmir⁹

Department of Radiation Oncology, Division of Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA; Department of Radiation Oncology, Mayo Clinic, Rochester, MN, USA.
Department of Genitourinary Medical Oncology, Division of Cancer Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA; Department of Translational Molecular Pathology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
Department of Radiation Oncology, Division of Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
Department of Radiation Oncology, Stanford University, Stanford, CA, USA.
Davidoff Cancer Center, Rabin Medical Center-Beilinson Hospital, Petach Tikva, Israel.
Department of Statistics and Data Science Institute, University of Chicago, Chicago, IL, USA.
Insitro, South San Francisco, CA, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, the Netherlands.
Department of Gastrointestinal Radiation Oncology, Division of Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA; Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.

References:

van Zwet E, Gelman A, Greenland S, Imbens G, Schwab S, Goodman SN. A New Look at P Values for Randomized Clinical Trials. NEJM Evidence 2024; 3(1): EVIDoa2300003.
Yang Y, van Zwet E, Ignatiadis N, Nakagawa S. A large-scale in silico replication of ecological and evolutionary studies. Nat Ecol Evol 2024; 8(12): 2179-83.

Read the Abstract

Related Content:

Power and Replication Rates in Randomized Cancer Trials - Pavlos Msaouel

Login