San Francisco, California (UroToday.com) Professor Sydes gave an overview and discussed the topic of when the clinical impact outweighs statistical significance.
Statistical significance defined through the P-value is a statistical summary of compatibility between observed data and what we would expect to see if we knew the entire statistical model is correct. If the entire statistical model was correct all the assumptions that were used to compute the P-value were correct as well. These assumptions are usually not well characterized or documented in the manuscripts that we read.
There are some common misinterpretations of single P values that need to be addressed:
- The P-value for the null hypothesis does not equal the probability of chance alone resulting in the produced observed association.
- A P value above 0.05 does not mean that no effect was observed, or the absence of effect was demonstrated
- Statistical significance does not indicate the scientific or substantially important relation detected
Professor Sydes mentioned an excellent paper by Greenland et al. describing 25 ways to misinterpret the P value1 that scientists should read. There's been an extreme amount of focus on the P values in past and current literature. Alternatively, we should try to focus more on clinical relevance instead of clinical significance. We should try to find a meaningful impact on the outcome for patients, examining the efficacy and adverse effects. Figure 1 demonstrates an excellent summary on how we should examine the data with regards to clinical relevance and statistical significance.
Figure 1 - statistical significance and clinical irrelevance
The effect is the estimate of the impact for the treatment being tested. Many primary outcomes in cancer trials use time to event data looking at overall survival and progression-free survival. Time to event data is challenging because of incomplete data leading to censoring.
It is important to understand the Kaplan Meyer curves that are being used in survival analyses (Figure 2 and 3). These simply describe all the patients being analyzed who eventually have an event or were lost to follow up at some point. Each step down in the graph is an event of interest.
Figure 2 and 3 – Understanding Kaplan Meyer curves:
The numbers beneath the X axis describe the number of patients at risk at each time point, meaning the number of patients who got this far without having an event (Figure 4). The numbers next to them, inside the brackets, represent the patients who had an event between the various time points (Figure 5).
Figure 4 – Number of patients at risk in Kaplan Meyer curve:
Figure 5 – Number of patients with events between the time points:
The log-rank test is a hypothesis test to compare the survival distribution of two samples. The hazard rate is the risk of an event overtime for each group, and the hazard ratio is the ratio of these hazards. This is usually derived from a Cox model, adjusting for important baseline characteristics. The confidence intervals are calculated with the same information as P values, so we need to read them with the same caution. The absolute effect expresses the absolute difference at a pre-specific time. This is better calculated using the hazard ratio rather than from reading differences in the Kaplan Meyer curve. Lastly, the number needed to treat is calculated as 1 divided by the absolute risk reduction.
Professor Sydes continued and discussed what drives sample size calculations. This is affected by the hazard ratio that is to be expected between the two groups, the significance level used and the power as can be seen in figure 6.
Figure 6 – What drives sample size calculation:
Concluding his talk, Professor Sydes emphasized some important questions to ask when reading a manuscript. We need to understand if this is a randomized control trial or a meta-analysis. It is important to understand how much uncertainty is around the estimates, how does the observed effect size compare to the targeted effect and is the observed effect plausible. Lastly, we need to ascertain whether the findings are consistent across the outcome measures.
In summary, P values are widely misunderstood. We need to consider the clinical relevance of the treatment effect that is shown. Confidence intervals have limitations too and this needs to be remembered. Lastly, subgroup analyses are not powered like the main analysis when reported.
Presented by: Matthew R. Sydes, MSc, CStat, Csci, University College London
Written by: Hanan Goldberg, MD, Urology Department, SUNY Upstate Medical University, Syracuse, NY, USA @GoldbergHanan at the 2020 Genitourinary Cancers Symposium, ASCO GU #GU20, February 13-15, 2020, San Francisco, California
- Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European journal of epidemiology 2016; 31(4): 337-50.