Machine Learning Stemness Signature for Prostate Cancer - Agustina Sabater & Geraldine Gueron

January 28, 2025

Andrea Miyahira speaks with Geraldine Gueron and Agustina Sabater about a paper on neuroendocrine prostate cancer. The discussion focuses on the development of a seven-gene stemness signature that predicts cancer progression. Using machine learning algorithms and multiple data sets, they create a risk-score model that shows significant prognostic value across different stages of prostate cancer, including metastatic disease. The signature performs comparably to existing commercial tests like Decipher but requires fewer genes. The researchers emphasize their goal of making this diagnostic tool more accessible and affordable, particularly in Latin America where prostate cancer mortality rates are high. They plan to implement the signature as a simple qPCR test to support clinical decision-making at the time of biopsy.

Biographies:

Agustina Sabater, MSc, Doctoral Fellow at Laboratory of Cancer and Inflammation (IQUIBICEN), Head of Practical Assignments (UADE), Institute of Biological Chemistry, School of Sciences, University of Buenos Aires, Buenos Aires, Brazil

Geraldine Gueron, PhD, Head at Laboratory of Cancer and Inflammation, Institute of Biological Chemistry, School of Sciences, University of Buenos Aires, Buenos Aires, Brazil

Andrea K. Miyahira, PhD, Director of Global Research & Scientific Communications, The Prostate Cancer Foundation


Read the Full Video Transcript

Andrea Miyahira: Hi, everyone. Welcome to UroToday. I'm Andrea Miyahira here at the Prostate Cancer Foundation. I'd like to welcome Doctors Geraldine Gueron and Agustina Sabater of the University of Buenos Aires Conicet. They will share their recent publication, Unmasking Neuroendocrine Prostate Cancer with a Machine Learning-Driven Seven-Gene Stemness Signature That Predicts Progression, published in the International Journal of Molecular Sciences. Dr. Gueron and Dr. Sabater, thank you for joining us.

Geraldine Gueron: Thank you, Andrea, for inviting us. I'm going to let Agustina show you what we did. She's the first author, and she is an excellent speaker. So, I'll leave the talk to you, and afterwards I'll join for the questions.

Agustina Sabater: First of all, thank you for having us here to present our recent publication in the International Journal of Molecular Sciences. In this work, we focus on stemness-associated genes as potential prognostic biomarkers of prostate cancer aggressiveness. We focus on stemness because we acknowledge the importance of cancer stem cells in tumor aggressivity and in the metastatic process.

We gather 144 stemness-associated genes from the literature that have been previously linked with prostate cancer stem cells, and we perform transcriptomics and survival analysis using publicly available data sets to define potential prognostic biomarkers that could aid clinical decisions.

So, first of all, we perform transcriptomic analyses for these 144 stemness-associated genes, and we spanned seven data sets, including normal prostate glands, primary, metastatic, and castration-resistant prostate cancer tumors. On the bottom left, you can see a volcano plot that exemplifies the differential expression analysis performed for the TCGA-PRAD data set. And on the heatmap on my right, you can see a summary of all 11 comparisons that we assessed.

Each row of this heatmap represents a different differential expression analysis with a different comparison. There's a subgroup of genes on the left that is consistently upregulated in the comparisons, and another group of genes on the bottom right that is consistently downregulated [INAUDIBLE]. Please note that 139 of the 144 stemness-associated genes presented significant dysregulations in these comparisons.

We then perform a survival analysis to assess the association between these genes and the risk of different survival events. You can have one of three different results when you perform a survival analysis:

Genes where the highest expression is associated with worse prognosis (high risk of the event).Genes where low expression is associated with worse outcomes.Genes that have no significant associations with these. We were assessing 144 genes in five data sets with more than 1,200 patients and seven different comparisons regarding progression-free, relapse-free, metastasis-free, disease-free, and overall survival. So these are more than 1,000 Kaplan-Meier plots. Of course, I'm not going to show them all now, but I'm going to show you this heatmap that summarizes our survival analysis.

This time, each square of the heatmap is a different Kaplan-Meier plot. You have the color code that we defined with the previous results I showed you. We have a group of genes where high expression correlates with worse outcomes and others that have low expression associated with higher risk of the events.

However, when we perform multivariable survival analysis—which takes into account the different clinicopathological characteristics that are available within each data set—we lost almost all significance. This result highlights the importance of creating a signature, of constructing a model that could predict progression, rather than just relying on one specific gene to try to improve the clinical setting and management of cases.

So we performed—actually, we used—a machine learning algorithm called Random Forest, which ranks the genes within each data set according to their importance in predicting the event. We had these different training data sets where we could rank all 144 genes, and then we calculated the mean relative importance so we could define which of these genes were top-ranking for predicting the events.

We found 15 genes that are potential prognostic biomarkers and candidates for our gene signature. We then performed LASSO regression, which is a model that allowed us to create this risk-score calculation. With this risk-score calculation—created using the TCGA-PRAD data set—we can now calculate different risk scores for each patient, assign different values, and then order them according to their risk of the progression event.

You can see it works perfectly in this shape because it’s the cohort that was used to create this model. But it also works and has a significant association with the event in all the data sets except for one. There are two ways you can assess the performance of a risk-score model. You can either stratify patients by the median risk-score value—this gives you two groups, a high-risk group and a low-risk group—and you can compare and see if this is working. Or you can assess the risk-score as a continuous variable. This means that you order the patients according to the risk-score, and you have a gradient of risk, letting you see exactly which patients are more prone to relapse or progression.

On the right, you can see we calculated what's called the concordance index, which is a measure of performance of our risk-score. We got values of accuracy of around 65%. Then we validated our risk-score in external data sets that were not used to model this signature, and again we found significant results—significant associations between our risk-score model and the risk of an event.

Importantly, we are using the Stand Up To Cancer data set, which has metastatic samples from prostate cancer patients. Even in this advanced stage of the disease, we found a significant association between the risk of death and our risk-score. Patients with high risk-score had nearly a two-fold increase in the risk of death. Another thing I'm not showing you here is that we also performed multivariable survival analysis—like we previously did with the individual genes—but this time, the results remained significant. So this signature, this risk-score model, is adding value to what is already used in clinic.

Finally, we wanted to assess if there were any associations between the genes in our risk-score and other clinicopathological characteristics of the tumor. We assessed the M.D. Anderson prostate cancer patient-derived xenograft (PDX) cohort in collaboration with Dr. Estefania Labanca from M.D. Anderson. This PDX cohort is really well annotated and has many defined features. So, we could assess what's happening when we perform an unsupervised clustering analysis only using our seven genes from the signature.

We observed that we had a cluster of neuroendocrine PDXs that were grouped together according to the transcriptomics data of only these seven genes. Particularly, the neuroendocrine PDXs also had a higher risk score when we calculated it using our risk-score model that I showed you previously. So we then decided to assess a neuroendocrine prostate cancer patient cohort that has neuroendocrine samples, but overall it had a better pathological classification of the neuroendocrine samples.

We observed, only using our seven genes again, that we had a grouping of the [INAUDIBLE] neuroendocrine carcinomas (in light blue), and these are the ones that had the highest risk out of four. When we perform ROC curve classifying performance analysis, we found a result of 0.91 when assessing our gene score as a classifier for large cell neuroendocrine prostate cancer.

So, the take-home message here is that our seven-gene-associated signature is a significant advancement in prostate cancer prognosis. It doesn’t matter if you are assessing a primary tumor or later stages of the disease, as we have shown that it also works in metastatic tumors. It is also able to classify tumors as neuroendocrine, and the particularly challenging large cell subtype that we know is very aggressive and deadly. This tool may also help identify the most aggressive tumor subtypes that may not yet display neuroendocrine transdifferentiation—and maybe they never will—but still represent a high risk for adverse outcomes. We can do that with only seven genes, which is what adds the most value to our work. We are offering this way a robust and cost-effective tool that has potential applications to personalize and make the choice of therapies more effective in the clinical setting.

Here’s the team. Thank you very much for having us. And here you also have the publication. Thank you.

Andrea Miyahira: Thank you so much for presenting this paper. So, have you compared the performance of this model with other gene expression-based biomarkers?

Geraldine Gueron: So yes, we have compared the performance of our signature to Decipher, Oncotype, and Polaris, and a couple more. I'm just going to show you this. We calculated the concordance index—Agustina explained it before—which is a metric that measures how well the model can distinguish between individuals or patients with higher and lower risk of an event.

If you check, our signature performs very similarly to the other signatures. And not only that, in some cases for some data sets it outperforms the other signatures. So, this shows that ours is comparable in terms of predictive power for progression, and it highlights the potential of our model to complement or even enhance existing clinical tools. One more thing: just a reminder, these signatures range between 17 and 46 genes. We’re only assessing seven, and now we're trying to narrow that to five.

Andrea Miyahira: OK. Really wonderful to see that. So how would you envision applying this in the clinic?

Geraldine Gueron: Our reality in Latin America is quite different. There's not as much penetration of precision medicine in the region, mainly because these types of tools are unaffordable. The idea—because this is part of a startup company—is that we're trying to develop these diagnostic precision tools that can aid in decision-making. The idea is to be able to use these AI tools, the machine learning tools, to come up with these signatures and be able to apply them with a simple qPCR, real-time PCR.

If you look at the Latin America region in terms of prostate cancer, it seems as if the incidence is quite low compared to the rest of the world. But when you look at the mortality rates, they fire up completely. So this makes us think that there's a problem at the moment of diagnosis. If we could add value by giving these tools at a price that is affordable for the population, maybe we could make a significant change in terms of diagnosis.

So the way we see it is that our signatures are intended to support clinical decision-making at the moment of the biopsy. The same way Decipher works, the clinician would obtain a score and would decide on the aggressiveness of that tumor at that moment, specifically because most of the prostate cancer cases clinicians receive here are Gleason 7 intermediate-risk scores. We are not certain of the tumor’s aggressiveness in the future; there are still unknowns, and we think it would be well suited for most of those cases to decide whether you just do active surveillance or take action in a different way.

Andrea Miyahira: OK. Thank you. And what further validation steps are needed, or what are your next steps?

Geraldine Gueron: So we have more than one signature. We have the technology already. And of course, our next step is to validate this and start our prospective studies. We have agreements with hospitals from both the private and the public sectors, and we're beginning to do the tests. The final idea is to create this kit where we could measure these signatures—five to seven genes—by a simple qPCR, and then obtain a risk-score that could be attached to the diagnosis.

Andrea Miyahira: OK. Thank you. Those are wonderful goals, and I wish you luck in that. Thanks for coming on and sharing this with us today.

Geraldine Gueron: Thank you for having us.