Highlighting the General Importance of Using Public Data Repositories for Hypothesis Generation and How It Changed Cancer Basic Research - Beyond the Abstract

To this date, the ongoing amounts of data will be digitally collected, and the data repositories are vast and expanding rapidly. In addition, the management, analysis, and storage of scientific data are also advancing. How to use this data? What are the benefits of public data collections? Could we transform the resource into information and knowledge to promote and to achieve our research? Big public data repositories have been successfully used in astronomy, medical research and political decision-making.

A profound scientific hypothesis is needed at the beginning of every research project. Therefore, gathering enough evidence for developing sound hypotheses is inevitable. Bioinformatics analysis of public databases is a new option and can support or even substitute the classical pilot study used previously. Cancer is a major threat to human health, bearing heavy burdens on the patient, family, and health care systems. Modern cancer research focuses on two major issues: (i) the development and progression of cancer, and (ii) to develop new treatments. In 2015, United States President, Barack Obama, highlighted 'precision medicine' as one of the major goals in modern medicine. However, there are many challenges we have to meet until this goal will come into reach.

One challenge is the identification of proper cellular targets to improve cancer diagnosis, surveillance, and treatment. This requires the analysis of big amounts of data, which can only be gathered by huge studies including large amounts of patients worldwide. It is clear to all that the computational techniques and big data will be mandatory in the design and delivery of precision medicine.

New road promotes to conquer cancer

Cancer is one of the most complex diseases involving dynamic changes and aberrations in gene regulation, protein synthesis, and post-translational modifications. Moreover, each type of cancer can be characterized by different molecular profiles. Therefore, integrating data from high-volume databases is essential to getting more precise and comprehensive information about cancer, and to promote our understanding of cancer.

Based on using an integration of various omics technologies, the Human Protein Atlas funded by the Knut and Alice Wallenberg Foundation was initiated in 2003. The atlas provides information on the subcellular localization of the proteins in different cells, the distribution of the proteins across tissues and organs, and the different expression levels of the proteins in pathology.1 Meanwhile, several specialized explorative tools have been integrated, namely the blood atlas, brain atlas and metabolic atlas.2 Until now, The Human Protein Atlas program has already contributed to several thousands of publications in the field of human biology and disease.

From 2005 until now, the Cancer Genome Atlas (TCGA) is a program of the US National Human Genome Research Institute with aims to catalog genetic mutations responsible for cancer by using genome sequencing and bioinformatics to apply high-throughput genome analysis results. Until now, over 20,000 primary cancer and matched normal samples spanning 33 cancer types have been characterized.3 Based on the data from the TCGA database, thousands of papers have contributed significantly to our understanding of cancer.

The Human Metabolome Database is a Canadian project released as a freely available public resource for metabolomics data and analysis tools in 2007 by Wishart and colleagues and supported by the various Canadian health and research organizations.4 Currently, the database contains 114,193 metabolite entries, linked to 5,702 protein sequences and hyperlinked to numerous other databases, e.g. KEGG, PubChem and UniProt.

Providing huge amounts of data acquired by high throughput technologies, the public data repositories enable basic and clinical researchers to investigate cancer-related differences and changes from the DNA/RNA level to the protein level at a high-volume scale. Several bioinformatics open-source projects have been initiated for assistance in the analysis of those data. R is a free and commonly used software environment for statistical computing and graphics in analysis work. Python is a well-known programming language that lets you work more quickly and integrates clinicians' systems more effectively. In relation to the public data repositories, computational projects such as R, UALCAN, the UCSC Xena Project and Gene Expression Profiling Interactive Analysis (GEPIA) are facilitating tumor subgroup analysis and survival analysis, and standardized visualization of analysis results.

Further integrative analyses aim to identify cancer-related pathways, thus enabling the discovery of novel therapeutic targets. STRING is a public database that focuses on predicting protein to protein interactions. KEGG is a database resource for understanding high-level functions and utilities of the biological system. Both databases predict cancer-related pathways by using statistical scoring and annotation algorithms, providing a reproducible data analysis via standardization of computational workflows. The number of publications based on bioinformatics analysis has exploded during recent years. The PubMed search for ”bioinformatics analysis” AND ”cancer” retrieved more than 20,000 related publications in the past five years.

Obviously, public data repositories and innovative analysis technologies are helping us to understand the potential core molecular pathways involved in carcinogenesis, progression, and recurrence of cancer. Furthermore, the results potentially provide prospective answers for clinical questions and may promote new, innovative cancer therapies, diagnostic methods, and preventive strategies. Bioinformatics analysis and public dataset not only boost the research but also help to find new roads for cancer treatment and prevention.

In these comments, we highlighted the general importance of using public data repositories for hypothesis generation and how it changed cancer basic research. Nevertheless, the research in silico study is limited. To test a hypothesis generated by bioinformatics analysis, detailed studies in vitro and in vivo are needed for validation of the biological functions in cancer. In the future, artificial intelligence (A.I.) and machine learning will be significant catalysts for dealing with these large data sets, providing more reliable and more rational insights into cancer biology.

clinical questions and molecular mechanisms

Written by: Mandy Berndt-Paetz, PhD, and Jochen Neuhaus, PhD, Department of Urology. University of Leipzig, Leipzig, Germany; Chuan Zhang, MD, Department of Urology, University of Leipzig, Leipzig, Germany, and Department of Urology, Chengdu Fifth People’s Hospital, Chengdu University of T.C.M., Chengdu, China

References:

  1. Uhlén, Mathias, Erik Björling, Charlotta Agaton, Cristina Al-Khalili Szigyarto, Bahram Amini, Elisabet Andersen, Ann-Catrin Andersson et al. "A human protein atlas for normal and cancer tissues based on antibody proteomics." Molecular & cellular proteomics 4, no. 12 (2005): 1920-1932.
  2. Thul, Peter J., Lovisa Åkesson, Mikaela Wiking, Diana Mahdessian, Aikaterini Geladaki, Hammou Ait Blal, Tove Alm et al. "A subcellular map of the human proteome." Science 356, no. 6340 (2017): eaal3321.
  3. Hoadley, Katherine A., Christina Yau, Toshinori Hinoue, Denise M. Wolf, Alexander J. Lazar, Esther Drill, Ronglai Shen et al. "Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer." Cell 173, no. 2 (2018): 291-304.
  4. Wishart, David S., Yannick Djoumbou Feunang, Ana Marcu, An Chi Guo, Kevin Liang, Rosa Vázquez-Fresno, Tanvir Sajed et al. "HMDB 4.0: the human metabolome database for 2018." Nucleic acids research 46, no. D1 (2018): D608-D617.
Read the Abstract