Comparative testing of reliability and audit utility of ordinal objective calculus complexity scores. Can we make an informed choice yet?: Beyond The Abstract

In last few years multiple Percutaneous Nephrolithometry (PCNL) scoring systems have been described which meaningfully describe its complexity with the aim to be associated with the stone free rate (SFR). They broadly fall into two categories –

1) Ordinal scoring systems - described from single institutions which grade calculi in ordered grades of increasing complexity with/without additional factors thought to be associated with difficult surgery, namely - Guy’s Stone Sore (GSS, assigns calculi into 4 grades based on elements of number, location, and presence or absence of abnormality in the renal unit)¹, S.T.O.N.E. nephrolithometric score [assigns score (range 5-13) based on calculus size, distribution, hardness and renal and patient attributes of the presence of hydronephrosis and skin to stone distance respectively]², and Seoul National University Renal Stone Complexity Score [S-RESC, considers the distributional complexity of calculi and assigns score (range 1-9) relative to the presence of calculus across 9 stations in the kidney with 1 point assigned for each station]³.

2) Continuous predictive scoring system – CROES nomogram is a statistically derived data-driven risk estimator model based on regression modelling derived from multiple institutions which gives a continuous score which can ‘predict’ dependent outcome of SFR with explanatory variables of stone burden, case volume, prior stone treatment, staghorn stone, stone location and stone count⁴.

Yet, these scoring systems are still not widely used in clinical practice. One reason for this could be the fact that clinicians cannot visualise how the ability to predict SFR would improve their clinical practice or outcomes. However, it is obvious to see how meaningful categorisation of calculus complexity could fulfil roles more than just being associated with SFR, as calculus complexity is an obvious confounder when one intends to study any aspect or outcome of PCNL. There are multiple advantages of having a uniformly followed, easy to apply and standardized system of grading complexity of PCNL such as –

Accurate prediction of outcomes, prognosis, and necessity of ancillary procedures after PCNL.
Objective assessment of technique modifications which claim superior outcomes.
Audit/benchmarking of clearance rates which could also aid in referring complex cases to higher benchmark centres

Since data from different scoring systems was not amenable to direct cross comparison, it made sense for us to compare them and see if any particular scoring system outperformed others in terms of reliability (interrater and test-retest validity), and association with outcomes - in terms of SFR, complications, need for multiple PCNL sessions, and auxiliary procedures (as clinical experience showed these four to matter most in terms of the treatment decisions made by patients).

We chose to leave out CROES nomogram as its regression coefficients were derived to ‘predict’ SFR, and as such it cannot be used to give a similar continuous predictive score for any other outcome. All that could be proved was whether nomogram scores were significantly higher or lower for outcomes other than SFR (which does not really aid in decision making in terms of providing a risk estimate or clinical cut-off point) or break the results of nomogram score into arbitrary categories (which reduces its predictive and discriminative power, with doubtful clinical relevance, and chance of introducing confounding if categories are uneven; as larger categories may inherently carry larger risk) and treat it like a ordinal score.

Multiple such comparative studies have been previously carried out. However, none of them were designed with statistical sample size considerations, and thus may have been underpowered to detect differences especially in terms of assessing outcomes other than SFR. Ours is the first study which pre specified sample size estimation to avoid such errors. ‘Discrimination’ is an important performance characteristic of a scoring system and implies the ability to differentiate between those with and without the outcome under study. Receiver operating characteristic (ROC) curve analyses is a known method to assess discriminative ability (with area under ROC curve>0.7 accepted as a fair test and >0.6 accepted as poor test). Designing a study which compares such discriminative ability of scoring systems requires data on correlation between scores among those with and without the outcome of interest. Such data had never been previously reported and thus we assumed the scores to be only moderately correlated (correlation coefficient 0.4, which mandated us to increase sample size). Thus, we designed our study with an ability to detect differences in area under ROC curve of 0.7 vs 0.6 with 80% power with an ultimate study population of more than 600 cases.

We found all scoring systems to have high reliability and meaningful association with all outcomes of interest, though on cross comparison S-RESC score more consistently emerged superiorly associated with all outcomes. S-RESC score considers the distributional complexity of calculi which on deeper analyses is actually is an amalgamation of multiple factors reflecting ‘calculus complexity’, as it may include the influence of stone number, size and anatomic location (higher strength of each contributor denoting the higher probability of stones being found at multiple locations). Further, it has recently also been validated to predict SFR after Retrograde Intrarenal Surgery, thus, raising hope for a common audit tool for all renal stones managed by ‘any’ invasive modality⁵. Also, a clinically useful scoring system should preferably quantify harms and benefits which could lead to a determination of an optimum decision threshold and Decision curve analysis is one recently described method which allows for such evaluation of prediction models⁶. Based on the same, developers of S-RESC score report “positive net benefit in almost all threshold probabilities to predict SFR after single tract PCNL”. However there are unique advantages to other scoring systems too, for example GSS has a unique ability to be applicable on plain X-ray films and this could suit resource-poor settings. Thus, we believe the decision to adopt a common audit tool for PCNL is likely to be influenced by factors more than just statistical fit.

Our paper also highlights areas of potential improvement in reliability of these scores and clarifies important aspects on their practical application including future roles such as benchmarking, audit, training and objective assessment of surgical technique modifications.

As ours is the first study to comprehensively compare all three ordinal scores together across multiple PCNL outcomes, we recommend more such comparative studies in the future. We also encourage investigators to correctly power their studies when it comes to comparative assessment so as to avoid statistical type 2 error. To aid similar future comparative studies we report that correlation among scoring systems for both positive and negative outcomes for SFR is at least 0.7, while it is at least 0.6 for other mentioned outcomes. Thus for example – similarly designed comparative studies to compare discriminative ability for SFR for two scoring systems would require a minimum sample size in excess of 200 (and 300 in case all three scores are compared together, assuming a SFR of 76%).

Written by: Manav Suryavanshi, Jiten Jaipuria

Read The Abstract
References:

Thomas K, Smith NC, Hegarty N, Glass JM. The Guy’s Stone Score—grading The Complexity Of Percutaneous Nephrolithotomy Procedures. Urology 2011;78:277-81
Okhunov Z, Friedlander JI, George AK, et al. S.T.O.N.E. Nephrolithometry: Novel Surgical Classiﬁcation System For Kidney Calculi. Urology 2013;81:1154-9
Jeong CW, Jung JW, Cha WH, et al. Seoul National University Renal Stone Complexity Score For Predicting Stone-free Rate After Percutaneous Nephrolithotomy. PLoS One 2013;8:e6588
Smith A, Averch TD, Shahrour K, et al. A nephrolithometric nomogram to predict treatment success of percutaneous nephrolithotomy. J Urol 2013;190:149-56
Park J, Kang M, Jeong CW, et al. External Validation and Evaluation of Reliability and Validity of the Modified Seoul National University Renal Stone Complexity Scoring System to Predict Stone-Free Status After Retrograde Intrarenal Surgery. J Endourol 2015;29:888-93
Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making 2006;26:565-74

Login