Open Access Journal of Biostatistics and Biometrics

A Study of Statistical and Machine Learning Methods for Cancer Classification Using Cross-Species Genomic Data

Published on: 2018-10-22


Use of gene expression profiling of animal model of a certain disease gives pre-clinical insights for the potential efficacy of novel treatments and drugs. Selection of an animal model, accurately resembling the human disease, profoundly reduces the research cost in resources and time. In this paper, we introduce and compare three different methods for classification of sub-types of cancer via cross-species genomic data. A statistical procedure based on analysis of variance (ANOVA) of similarity of gene expression between human and animal is used to select the animal model that most accurately mimics the human disease. Two other commonly used methods, logistic regression, and artificial neural networks are also examined and analyzed for the same data sets. The implementing procedure of each of these algorithms is discussed. Computational cost, advantage, and drawback of each algorithm are scrutinized for classification of simulated data and a real example of medulloblastoma (a type of brain cancer).


ANOVA; Logistic Regression; Artificial Neural Networks; Classification; Gene Expression Data; Cancer Sub-type