Open Access Journal of Biostatistics and Biometrics

A Study of Statistical and Machine Learning Methods for Cancer Classification Using Cross-Species Genomic Data

*Cuilan Gao
Department Of Biostatistics, University Of Tennessee At Chattanooga, Chattanooga, United States

*Corresponding Author:
Cuilan Gao
Department Of Biostatistics, University Of Tennessee At Chattanooga, Chattanooga, United States

Published on: 2018-10-22


Use of gene expression profiling of animal model of a certain disease gives pre-clinical insights for the potential efficacy of novel treatments and drugs. Selection of an animal model, accurately resembling the human disease, profoundly reduces the research cost in resources and time. In this paper, we introduce and compare three different methods for classification of sub-types of cancer via cross-species genomic data. A statistical procedure based on analysis of variance (ANOVA) of similarity of gene expression between human and animal is used to select the animal model that most accurately mimics the human disease. Two other commonly used methods, logistic regression, and artificial neural networks are also examined and analyzed for the same data sets. The implementing procedure of each of these algorithms is discussed. Computational cost, advantage, and drawback of each algorithm are scrutinized for classification of simulated data and a real example of medulloblastoma (a type of brain cancer).


ANOVA; Logistic Regression; Artificial Neural Networks; Classification; Gene Expression Data; Cancer Sub-type


Current cancer classification includes more than 200 types of cancer. For the patient to receive appropriate therapy, the clinician must identify the cancer types as accurately as possible. Unlike many cancers in adults, childhood cancers are not strongly linked to lifestyle or environmental risk factors. In recent years, scientists have made great progress in understanding how certain changes in DNA can cause cells to become cancerous. Some genes (part of our DNA) help cells grow, divide or stay alive while others slow down cell division or cause cells to die at the right time. Cancers can be caused by DNA changes that turn on or turn off functions of cells. Different types of cancer in childhood may be caused by different types of genes changes. To appropriately classify cancer types, therefore, molecular diagnostic methods are needed. The classical molecular methods look for the DNA, RNA or protein of a defined marker that is correlated with a specific type of cancer and may or may not gives bio-logical information about cancer generation or progression. Gene expression data [1] by microarray or next-generation sequencing (NGS) has been emerged as an efficient technique for cancer classification, as well as for diagnosis, prognosis, and treatment purposes [2-6].