AUA 2019: Population Cancer Databases: Goldmines and Pitfalls for the Researcher, Reviewer, and the Reader

Chicago, IL ( There are different types of datasets to consider. These include administrative or claims, such as the Medicare Claims Data. These are generalized from reimbursement claims, with data elements that include ICD-9 codes, CPT codes and discharge diagnosis. Other databases include longitudinal clinical registries. The data elements in these databases include common baseline factors and clinical outcomes collected repeatedly. There are also population-based surveys with a well-defined target population.

When selecting a dataset, it is important to consider the following points:

  1. Dataset complexity (which increases with the sample size), file structure (single vs. multiple records per individual), and complexity of survey design
  2. Cost
  3. Time to acquire access and internal review board approval
  4. Ability to link to other datasets
Dr. Albertsen emphasized how important it is to extensively know the various pitfalls of the chosen dataset. It is important to know who collected the data and for what purpose, and how patients were followed. You also need to appreciate the amount of missing data and how complete the dataset is. A good comparison between the most commonly used databases (SEER, NCDB and SEER-Medicare) is shown in table 1.

Table 1: Comparison of some of the commonly used databases:1

Once the data has been acquired and analyzed, Dr. Albertsen stressed that it is important to use the data to tell a meaningful story and always keep the ultimate goal in mind.  Sensitivity analyses are always good to use as they improve the overall story. It is also critical to make sure that the findings are internally consistent, and intelligent use of figures is always recommended, as these are more easily remembered by readers.

Dr. Albertsen concluded his great talk and stated the main advantages and disadvantages of these databases. The advantages include:

  1. Ability to assess outcomes that would be impossible to assess with primary data collection
  2. Large sample sizes can be available in relatively short time frames
  3. Costs are frequently a fraction of the primary data collection
  4. Modern computers can perform remarkably sophisticated analyses
The disadvantages include:

  1. Very difficult to prove causality
  2. Populations and data elements included can often lead to selection biases
  3. Key variables that are needed to control for confounding are often absent
  4. Missing data can compromise generalizability and long-term outcomes
  5. Data elements often lack validation
Presented by: Peter Albertsen, MD, University of Connecticut Health Center

Written By: Hanan Goldberg, MD, Urologic Oncology Fellow (SUO), University of Toronto, Princess Margaret Cancer Centre @GoldbergHanan at American Urological Association's 2019 Annual Meeting (AUA 2019), May 3 – 6, 2019 in Chicago, Illinois

1. Boffa DJ, Rosen JE, Mallin K, et al. Using the National Cancer Database for Outcomes Research: A Review. JAMA Oncol. 2017 Dec 1;3(12):1722-1728. doi: 10.1001/jamaoncol.2016.6905.