A Subgroup Analysis of mCRPC Clinical Trials Presentation - Susan Halabi
September 24, 2019
Susan Halabi, Ph.D., Professor and Chief of the Division of Biostatistics and Bioinformatics, Biostatistics & Bioinformatics School of Medicine, Member of the Duke Cancer Institute, Duke Cancer Institute School of Medicine, Duke University Medical Center, Durham, North Carolina, USA.
Written Coverage: APCCC 2019: Subgroup Analysis of mCRPC Trials
Download Presentation Slides
Susan Halabi: Thank you. I would like to thank Silke and Aurelius for inviting me again to speak on subgroup analysis. I promise you I'm not going to show any equations today. We all know that the general assumption in a clinical trial is that the hypothesis, the primary hypothesis that we're testing, usually addresses overall an average treatment effect in the study population. This hypothesis obviously is a priori specified, and normally there is no assumption of homogeneity of effect across subgroups.
What's the challenge here? The challenge in personalized medicine is this. There is first a danger of subgroup analysis on how you treat patients based on a specific treatment, but also we are in conflict because how do we apply the overall results of large clinical trials to individual patients in the clinic without taking into account their determinants? I don't really have the answer to that challenge because it's going to depend obviously on the design, but what we all are aware and recognize is that subgroup analysis is pervasive in clinical trials. We've seen them in positive and in negative trials.
Even though the title of the session is an mCRPC, I'm going to draw an example from a recent positive trial, the ENZAMET. In this trial, patients who were treated with enza have prolonged survival compared to standard of care. This is the usual forest plot that we see in most Phase III trials. As you can see here we have the overall effect from the trials and we have several subgroups where we're testing the effect of the treatment. Now on the right-hand side you have two columns, you have P value for interaction and adjusted P value. What I'm going to draw attention is to the analysis based on early docetaxel plans. Yes and no. The question to you is how do we interpret the data? How do we apply this to the clinic? Should we give patient docetaxel? Is docetaxel actually works with enza? I'll come back to the answer later.
This is an example of a negative trial where patients were randomized to PROSTVAC, either PROSTVAC or a combination of PROSTVAC and another drug or to placebo. This trial was negative. Again, a subgroup analysis was done. I want to draw your attention to a P value for interaction of age by treatment in this trial. The question here, even though the trial is negative, would you give patient the drug, patients who are less than 71 years of age? We'll come back to the answer at the end.
I'm going to spend the next several slides talking about statistical sins, or you may think of them as statistical toxicity. Although I don't think there is toxicity associated with statistics. I would like to demonstrate these important concepts and I'm going to only focus on three thus related to subgroup analysis. I'm using an important quote from Sir Richard Peto who said, "Subgroup analysis is a machine for producing false-negative and false-positive results." You are already caution here. Beware of any subgroup analysis.
The first sin or the first statistical mistake is when conducting subgroup analysis. The type I error rate is really a function of the number of mutually exclusive analysis that's been done. So if for example you have five subgroups and you are reporting P values on that, the probability that one significant result will occur is 23. When the number of subgroup increases, let's say to 10, that probability goes up to 40%. Let's say you have 20 subgroups that you're comparing in a trial, you will have almost over 63% probability that at least one comparison is statistically significant.
Going to the ENZAMET trial, even though the investigators reported P values for interaction and they have adjusted for multiplicity, which obviously is important, my preference would have been to not report the P values. I think just looking at the hazard ratios and the 95% confidence interval would have made a stronger statement. Now, some may have thought while it may be better to report the P values for the stratification variables, which were volume of disease, early docetaxel, among others. But again, my preference is not to report that.
The second common mistake is people think it's only a multiplicity issue, but also power really is at stake here, so don't be misled. In this table, what I'm showing you is, what is the power associated with the ratio of the number of events in the subgroup out of the total? As a general rule of thumb, I would say if you have at least 75% of your events associated to that subgroup, you may be okay. You may have a power. But still, I will refrain from doing any subgroup analysis for subgroups where you have less than 50% of the total events.
A third very common mistake is people often, when they do subgroup analysis, they look at the treatment effect within the group as opposed to the overall interaction. This is an analysis from the ALSYMPCA trial, where here the purpose is not to report the interaction, but the purpose from the investigators were to look at the effect of radium in patients who had prior docetaxel use or no previous docetaxel use. While they may have the reasons from the regulatory perspective for reporting the data in such manner, my preference would have been to report one graph and one P value for interaction. I think this is a more appropriate way.
Having said that, what are some safeguards for you, and how can you prevent misinterpretation or over-interpreting the data? There are really some, several criterias that have been published by Sun et al. and others where you look at credibility of subgroup. First of all, we can never underplay the role of chance. Chance and coincidence occurred more common. We have to know whether the treatment is consistent. Is there a strong preexisting biological support? Then the last point has to do with evidence supporting the effect based within and between study comparisons. Thank you. Matt Sydes already talked about that this morning and showed us how difficult it is to compare not only within study but across studies.
Going back to the ENZAMET trial, the question is, do you believe in that docetaxel, does ENZAMET work in patients who have received docetaxel? Are we over interpreting the data? Of course this is in the eye of the beholder. For me, I would look at the data and I place more emphasis on the overall trial results than the subgroup itself. Then the same thing for PROSTVAC trial. I will spend more time and emphasis on the negative results than the subgroup.
I created this, something that in my opinion called level of evidence, so obviously if a subgroup is important, one may consider it as a stratification variable in the randomization. But at the base of the pyramid or the level one evidence, I will call a priori designed trial where you're looking at treatment subgroup interaction. The next level is a pre specified subgroup, and then obviously the worst is the post-hoc because there is no evidence at all for that. The only evidence is in terms of evidence from post-hoc analysis, I would say it's very good for data generation but no more than that.
As I said, there are safeguards in terms of the design and the analysis. You have to have clear description of hypothesis and not only clear description but direction. You want to have a statistical test of treatment subgroup interaction. Then you want to place more emphasis on the overall result than the subgroup itself. Then when you're looking at the homogeneity, you want to look at test of treatment subgroup interaction rather than looking at treatment effect within subgroup. You want to interpret the results of the trial in the context of other trials based on principles of biological rationale and coherence.
To conclude, I would say the best statistical design obviously answered the primary question, and the primary question is almost always testing one drug versus another in the overall population. You want to make sure it's feasible. Planning is key. I definitely recommend to all of you to avoid the statistical sins. Obviously there is a reason why people don't do subgroup trials because these trials are much, much larger than regular trials. Meta-analysis plays a critical role.
A final note, rather than reported isolated P values, articles should include and emphasize the effect sizes and 95% confidence intervals. I'm sure you're all aware of the New England Journal of Medicine new guidelines for statistical reporting in the journals. Thank you.