Power and Replication Rates in Randomized Cancer Trials - Pavlos Msaouel

August 12, 2025

Pavlos Msaouel discusses reproducibility of phase III randomized trials in oncology with Ashish Kamat. His analysis of 632 trials enrolling nearly 500,000 patients showed median actual power of 49% with only 43% replication probability for studies achieving p<0.05. KEYNOTE-564 in adjuvant kidney cancer succeeded due to 95% design power and narrow confidence intervals versus 90% or less in other immunotherapy trials. The research team developed free web calculators to estimate replication probabilities from published trial data.

Biographies:

Pavlos Msaouel, MD, PhD, Assistant Professor, Genitourinary Medical Oncology, Translational Molecular Pathology, University of Texas, MD Anderson Cancer Center, Houston, TX

Ashish Kamat, MD, MBBS, Professor of Urology and Wayne B. Duddleston Professor of Cancer Research, University of Texas, MD Anderson Cancer Center, Houston, TX


Read the Full Video Transcript

Ashish Kamat: Hello everybody and welcome to UroToday's Bladder Cancer Center of Excellence. I'm Ashish Kamat, urologic oncologist. And it's a pleasure to welcome to the forum, Dr. Pavlos Msaouel, who is one of my colleagues from MD Anderson right here in Houston. And even though this is the Bladder Cancer Center of Excellence, we've invited Dr. Msaouel to talk about something that really crosses all disease states, not just urologic oncology but for anybody that's doing oncology trials, and to be honest with you, any trials for that matter. So Pavlos, welcome and the stage is yours.

Pavlos Msaouel: Thank you Ashish for having me. Good morning, everyone from Houston. My name is Pavlos Msaouel. I'm an assistant professor at the departments of GU Medical Oncology and Translational Molecular Pathology at MD Anderson in Houston. And I'm pleased to discuss today our work on the reproducibility of Phase III randomized trials in oncology. These are my disclosures and none are pertinent to today's presentation. And our recent work builds on this paper we published in JCO Precision Oncology last year where we curated 450 phase III randomized clinical trials in oncology that enrolled almost 340,000 patients that were published between 2004 and 2020.

And we found that the median actual power of these trials was only 49%. And this actual power was less than 80% in 65% of these trials, even though most of them were set up in the planning stage to achieve power of at least 80%, they didn't achieve it. They achieved much less than that usually. And as a result, only 53% of phase III trials in oncology claiming superiority had provided enough evidence to demonstrate clinically meaningful benefits with at least 90% probability. So this data can be easily used to calculate the probability of effects from the summary statistics that you get from journal papers of any phase III randomized trials in oncology using our free and easy to use web tool found in this link.

And we also compared the actual power and corresponding signal-to-noise ratios, abbreviated as SNR, of oncology Phase III randomized trials with the approximately 23,000 randomized trials from the Cochrane database across all of medicine. And you can clearly see here that the signal-to-noise ratio was clearly better in the oncology Phase III randomized trials compared with the broader medical randomized trials in the Cochrane dataset. Indeed, the median actual power of the randomized trials in the Cochrane dataset was only 13%, much less than the 49% of the oncology trials. But then the next question was, how does this low power and correspondingly weak signal-to-noise ratio influence the reproducibility of phase III trials in oncology? And we accordingly used, in our just published paper, our most up-to-date data set of 632 phase III oncology randomized trials that enrolled nearly half a million patients.

And the bad news is that by estimating the signal-to-noise ratio of this large data set, we found that if an oncology randomized phase III trial shows a p-value of less than 0.05, the usual cutoff for its primary endpoint, then the probability is only 43% that an exact replicate of this same phase III trial would again show P less than 0.05 for its primary endpoint. If the original trial showed P less than 0.01, then the replication probability defined as obtaining P less than 0.05 in a replicate trial increased to 60%. And if P less than 0.001 in the original trial, then the replication probability is 77%. Clearly, we should not always expect that repeating the exact same positive trial will again give us a positive result. And this graph shows the increase in replication probability.

If we increase the sample size by two or three or four times for the replicate trial, you can see that even doubling the sample size will not get the replication probability anywhere near a hundred percent. And these results clearly counter the nihilistic claims that lack of reproducibility in oncology is always due to scientific fraud. No. The mere nature of p-values will make them wobble in exactly this way. The good news though is that we found that the effect direction of oncology phase III randomized trials is highly reproducible. When the primary endpoint reaches P less than 0.05, then the probability is greater than 95% that the trial result has the correct direction. This means that a positive hazard ratio almost never flips signs towards the other direction that would favor the control group. This respects the principle of first do no harm. It is very unlikely that our new treatments showing P less than 0.05 will have worse efficacy than the control group. They may have other things, like being more toxic, et cetera, but they will not have worse efficacy most likely.

And we then provide, again, another free web tool shown here to calculate the replication probability using the summary data from any phase III randomized trial in oncology. Now, how do we put this in practice? These results help us understand what happened, for example, with the phase III randomized trials of adjuvant immunotherapy in kidney cancer summarized in this forest plot. What has puzzled GU experts is why only KEYNOTE-564, testing adjuvant pembrolizumab versus placebo, showed positive results for its primary endpoints. Are the differences between nivolumab and pembrolizumab in the adjuvant setting real? That would be weird.

Or did the duration of adjuvant therapy play a role? And what about inclusion criteria? For example, some of these trials did include non-clear cell kidney cancer histologies, whereas KEYNOTE-564 did not. Maybe all of these played some role, but we don't need these elaborate explanations to understand this data. All we need to do is focus on the data at hand. Notice that in aggregate, shown here, most of these trials had low precision as evidenced by the wide confidence intervals and thus low signal-to-noise ratio. In aggregate, they all point towards a weakly positive benefit from adjuvant immunotherapy. The only difference is, as you can see here, that KEYNOTE-564 had the highest precision as evidenced by its narrow confidence intervals.

This was not just luck, it was by design. KEYNOTE-564 had by far the highest design power compared to all the others. And you can see this in this summary table we published with Dan Shapiro, KEYNOTE-564 had a design power of 95%, whereas all the others had 90% or less. This is incredibly low power if we consider that, in the corresponding metastatic setting shown here, we see far more events and thus achieved much higher precision and signal-to-noise ratio than the corresponding adjuvant settings shown here. That is what happened with these trials. And that is the difference between adjuvant kidney cancer and adjuvant bladder cancer. Kidney cancer, on average, simply has a lower chance of recurrence, a good problem to have. And we can use these same principles and the free webtool to ask ourselves, okay, if we run again the exact same trial with the same sample size and eligibility criteria as KEYNOTE-564, what is the probability that we will again obtain P less than 0.05 for the primary endpoint of disease-free survival? 100%, 95%, 75%, 50%, 25%, or maybe even zero?

Some would think that this question is even impossible to answer, but it shouldn't be impossible to answer. The answer, based on our best available data, is actually around 75%. And anyone can easily estimate this using our web calculator as shown in this screenshot. You can just enter the hazard ratio and the 95% confidence interval for the primary endpoint of disease-free survival from KEYNOTE-564, and the replication probability comes up as 77%. We would need to double or triple the sample size of the replicate trial to get our replication probabilities to greater than 90%. This gives us useful context about what it takes to design actually reproducible trials in this setting. Note again also here that the directionality of the result is pretty robust.

The probability that adjuvant pembrolizumab results in worse disease-free survival than placebo is essentially zero, which is consistent with the results of every single phase III adjuvant immunotherapy trial in kidney cancer, all of which are most compatible with potential disease-free survival benefit, even if weak and not worse disease-free survival as would happen if their point estimates were on the other side. In summary, rigorous clinical science still delivers reliable signals, far from the claim that nothing ever replicates due to scientific fraud. However, a single so-called statistically significant result is rarely definitive by itself. Even though we generally do better in oncology than other medical trials, we still do need to improve the power and precision of our trials.

In oncology, for example, the confusion caused by the results of adjuvant immunotherapy phase III trials in kidney cancer is almost certainly due to their low power. To improve this, in future trials we can focus on populations with highest recurrence risk, incorporate biomarkers that can help us determine which patients are at highest risk for disease recurrence, and we can also use more high-resolution endpoints that yield higher signal-to-noise ratio. And we can simply even increase the effective sample size by planning to enroll more patients. There is no free lunch. Thank you again for the invitation and for allowing us to present our data. My contact information is shown on this slide. And I'll be happy to answer any questions now and in the future.

Ashish Kamat: Thank you so much, Pavlos. And again, we'll have the links that you mentioned during your talk on the UroToday website as well, so people can go there. Just to clarify, this is just open source and anyone can access those links, correct?

Pavlos Msaouel: Exactly, yes.

Ashish Kamat: Great. So, Pavlos, this does raise the question that often many of us wrestle with, which is, we have, for example, you showed in kidney cancer, but in bladder cancer too, I am sure you're aware of the adjuvant studies, the AMBASSADOR, IMvigor, CheckMate, various similar study designs. In fact, some would say almost the exact same study design, just different IO agents and vastly different results with a few overlapping confidence intervals when it comes to overall survival and some for disease-free survival. Do you think that you can explain a lot of what we see in such situations based on your statistical deep dive, or do you think that there's actually biology or you think it's a little bit of both?

Pavlos Msaouel: It's always some combination of both. Of course, biology and other factors always play a role. I can tell you that in kidney cancer at least, which I'm the most familiar with, you can practically explain most of what happened just by simple statistical principles. You don't need to go much deeper than that. It's the simple fact that kidney cancer, even clear-cell kidney cancer, which is typically among the more common ones the most likely to recur, it's not as likely to recur as some other cancers. This is the same principle you see in breast cancer.

Okay, when they do adjuvant trials in breast cancer for triple-negative breast cancer, which is much, much more likely to recur than hormone receptor-positive breast cancer, they use lower sample sizes, because they have more recurrence rates. But when they do adjuvant trials in hormone receptor-positive breast cancer, which is less likely to recur, their sample sizes are in the thousands. And that's where we need to be with kidney cancer. With bladder cancer, I believe, and correct me if I'm wrong, it does tend to recur more frequently on average, so the signal-to-noise ratio is expected to be better than the kidney cancer ones. But still improving that signal-to-noise ratio by biomarker stratification and understanding of the biology in which patients are at highest risk will be helpful.

Ashish Kamat: Yeah, and I'm glad you said that because a lot of us suspect that some of the disparate results from the studies are actually explainable just based on sample size. And even if you look at the EV Pembro data, I mean a lot of the events haven't occurred yet, but everyone has confidence in the sample size. Whereas, if you look in earlier phases of bladder cancer, you do thousand patient studies and there's a 2% difference. And do you explain that? How do you counsel patients? Is that something relevant to them? Is it not relevant? Is it something that you would've seen or don't see based on sample size calculations?

I mean, that's a whole different paradigm and I'm glad there are people like you that are working on refining this. But then that raises another issue. Again, the tool that you provided clearly is a tool that can be used and it's based on hard calculations. But we also hear about development of AI tools. In fact, the FDA has an AI tool, and it's been in the press recently where it's hallucinating clinical trials and references and statistical inferences. What are your thoughts or viewpoints on how we will be able to use AI to overcome some of the things that you talked about and where should we be cautious?

Pavlos Msaouel: Yeah, that's a great question. AI has been surprisingly effective, certainly surprising to me, in some aspects, and you have to be careful about some other aspects. We use AI in our group and in my lab essentially every single day. In fact, a year ago, we spent quite a few sessions every week in my lab just test-driving them and learning their strengths and weaknesses. And now we've reached a point where we're so familiar with it that we even allow them to semi-design at least some of our actual experiments. Again, you have to be cautious.

One of the things that I prefer to do is whenever I refer to an AI, either it will be something that's specialized to the topic at hand. For example, open evidence is pretty reliable when you're looking for things like drug toxicity, et cetera. And then for the more general ones, I might actually triage by looking at paid versions of at least two or three of them. And the more you use them, you actually learn that they're kind of like people, each one of them, based on what it has been trained, it has its own personality. GPT-4o, for example, tends to be more like your nerdy scientific brother. Claude tends to be sometimes a little bit more poetic, et cetera. With that in mind, there are many, many, many, many caveats. You should definitely not just rely on AI.

For example, we are content experts. We can tell easily where AI is off. And the problem is that AI can be off and it can be very confidently wrong. It can tell you with confidence that black is white, and that's something to be very careful with. But if used right and correctly, it is transformative for our practice.

Ashish Kamat: Yeah, absolutely. And that's a good segue because like you said, you can be confident that you're right in a clinical trial until you actually look at the probability of reproducing the effects from the summary statistics and then come up with your reproducibility. So on that note, Pavlos, I want to thank you for taking the time. I'm sure our audience will have questions. I'm sure they'll reach out to you. But thank you for taking the time and sharing your insights with us today.

Pavlos Msaouel: Thank you again for having me.