A Study of Statistical and Machine Learning Methods for Cancer Classification Using Cross-Species Genomic Data
Corresponding author: Dr. Cuilan Gao, The University of Tennessee at Chattanooga; Email: firstname.lastname@example.org
Current cancer classification includes more than 200 types of cancer. For the patient to receive appropriate therapy, the clinician must identify the cancer types as accurately as pos- sible. Unlike many cancers in adults, childhood cancers are not strongly linked to lifestyle or environmental risk factors. In recent years, scientists have made great progress in un- derstanding how certain changes in DNA can cause cells to become cancerous. Some genes (part of our DNA) help cells grow, divide or stay alive while others slow down cell division or cause cells to die at the right time. Cancers can be caused by DNA changes that turn on or turn off functions of cells. Dif- ferent types of cancer in childhood may be caused by different types of genes changes. To appropriately classify cancer types, therefore, molecular diagnostic methods are needed. The clas- sical molecular methods look for the DNA, RNA or protein of a defined marker that is correlated with a specific type of can- cer and may or may not gives bio-logical information about cancer generation or progression. Gene expression data 
by microarray or next-generation sequencing (NGS) has been emerged as an efficient technique for cancer classification, as well as for diagnosis, prognosis, and treatment purposes [2-6].
Gene expression profiling measures the expression lev- els of thousands of genes simultaneously. Most expression pro-filing studies focus on comparing the gene expression among biological conditions within the same species. For ex- ample, the experiments may compare the transcriptomes of tu- mors and normal tissue or between tumors arising in the same tissue. However, few work about comparing transcriptome data generated from different species had been done. Johnson et al. used a method called AGDEX to validate a novel mouse model of a human brain tumor . A complete demonstration about AGDEX can be found in the paper by Pounds et al. . Additional work about genomic expression patterns cross-spe- cies can be found in . The rationale about the cross-species comparison is that animal and human shared the majority of patterns of regulation across orthologous genes. Therefore the animal gene expression profiling can be used to build a predictive model that may be used in the analysis of human diseases.
Statistical methods for cross-species gene expression analy- sis could be used for diagnosis of human disease based on the similarity in the genomic profile between human and animal data. We have proposed a method to find the most accurate animal model of a certain human disease . This method is based on analysis of variance (ANOVA) of similarities between gene expression of human samples and gene expression of an- imal samples from a set of animal models to identify the ani- mal model which is the best model for a certain type of cancer. This scheme, that will be analyzed in this paper, defines and computes a chosen metric of similarity between each human sample and animal sample from each animal model resulting in multiple groups of similarities. Then a random block ANO- VA model is used to compare the group means of similarities among different animal models. Finally, post-hot multiple comparisons is applied to seek the best animal model of the human disease.
Logistic regression is a common tool in supervised classifi- cation problems both in statistics and machine learning ar- eas. Mount et al.  applied this approach to identify gene expressions predictive of early death versus long survival in early-stage disease. Beane et al.  have used LR for lung can- cer diagnosis that integrates genomic and clinical features. Ste- phenson et al.  have also integrated gene expression and clinical data using logistic regression modeling to predict pros- tate carcinoma reoccurrence after radial prostatectomy. One goal of many cancer genome projects is to discover cancer-re- lated gene selection and sub-types. Zhou et al.  proposed a Bayesian approach to gene selection and classification using the logistic regression model. Logistic regression is often used for two-class classification and if input samples are classified in more than two categories, a multi-class classification is re- quired . Test samples are categorized in the most proba- ble category. Algorithm of multi-class classification of human cancer type based on a trained logistic regression algorithm is illustrated in this paper.
Artificial Neural Networks (ANN) are computer-based al- gorithms which are modeled on the structure and behavior of neurons in the human brain and can be trained to recog- nize and categorize complex patterns. Pattern recognition is achieved by adjusting parameters of the ANN by a process of error minimization through learning from experience. They can be calibrated using any type of input data, such as gene-ex- pression levels generated by cDNA microarrays or next-gener- ation sequencing technology; and the output can be grouped into any given number of categories . Implementation of the artificial neural networks algorithm in classification and diag- nostic prediction of cancers using gene expression profiling is addressed in [2-16] and an example of this implementation in lung cancer prediction can be found in .
In this paper, gene expression profiles of animal models are used to predict the human cancer type. Mapping procedure between animal and human data is performed to match the orthologous genes between human and animal. Further-more, human gene expression data is deployed for cross-validation and test the trained algorithms. The advantages and disadvan- tages of the proposed methods, ANOVA test, logistic regression and artificial neural networks algorithm are observed and compared with artificial data and a real example of pediatric Medulloblastoma.
Description of Genomic Data
In microarray experiments, thousands of DNA sequences are aligned in probes and are exhibited in a high-density array positioned on a microscope slides. The relative expression of each probe(gene) is measured by comparing intensities of mRNA from tumor and reference tissue, see Figure 1. Thus gene expression levels of thousands of genes in each sample are simultaneously measured. Other sources of genomic data from sequencing technologies such as next-generation se- quencing are obtained by using technologies to amplify and compare expression of target DNA sequences with reference (digital counts data).
We generally illustrate the microarray-based human and animal gene expression data as following. gei,j represents the expression level the jth gene of the ith animal or human sample.
Figure 1. Gene expression matrix from multiple microarray experiments. The expression matrix is a representation of data from multiple microarray experiments. Each element is a log ratio of gene expression level of the testing sample (T) and gene expression level of the reference sample (R).
where genej represents the expression level of the jth gene of the ith animal or human sample. We assume ith animal model has a samples, each including n genes.
Each animal model corresponds to a particular subtype of
- Metrics of similarity
The similarity metric is a measurement of how similar of gene expression levels between a pair of human sample and animal sample. There are many different ways to measure the similar- ity. We will use four different metrics to measure this kind of similarity.
- Semi-Correlation: One way to show the correlation between two matrices is to find the correlation between the columns of the two matrices. Experience shows that this method may result in non-normally distributed data. So to remedy this defi- ciency a new metric is defined based on the correlation coeffi- cient between ith human sample and the samples of mth animal model. To define Yim in (3), the column similarity is defined as
disease or cancer. Several samples may belong to the same
x =∑(h -h̅ )×(a
–a )T (4)
i i m,j
Animal models play a pivotal role in translation biomedical research. The scientific value of an animal model depends on how accurately it mimics the human disease. In principle, mi- croarrays collect the necessary data to evaluate the transcrip- tomic fidelity of an animal model in terms of the similarity of gene expression levels (by microarray gene expression file) with the human disease. We access this type of similarity by using different types of similarity metrics between each pair of a vector of gene expression of the human sample and a vector of gene expression of an animal sample from a certain animal model. Thus we end up with a set of similarity measurements.
where in (4) am,j is the jth sample of the mth animal model, hi is the ith human sample, ̅ refers to the mean of variable ̅and ̅T denotes the transpose of ̅. Yim is the vector of all (xi,j,m)s in the mth animal model.
- Cosine: Second metric, like the first metric, defines a similar- ity between human and animal samples, which are columns of the human model and each animal model. This metric is also used in  and characterizes the Cosine between two vectors.
√∑ √∑ (5)
Then ANOVA method is used to analyze whether the similarity is associated with the types of animal models and the human sample labels.
In the ANOVA model described in the following equation, the type of animal model (model label) is considered as a fixed ef-
Definition of parameters in (5) is like the counterparts in (4).
- Euclidean Distance: Euclidean distance is a simple yet pow- erful way to measure the similarity between two vectors. The Euclidean distance between vectors am,j and am,j is defined as follows:
fect of similarity, and the human sample is considered as a ran- dom effect of similarity since the human samples are randomly
selected from a large population of medulloblastoma patients.
In other words, euclidean distance is the square root of the sum of squared differences between corresponding elements
of the two vectors hiand a
where η is the grand mean, αm is the effect of mth model which is the similarity of mth animal model to the humam samples and εij is the associated error. Null hypothesis or H0 is: there is no statistically significant difference in similarities among the animal models, i.e α1 = α2 =….= αm, where αi is the effect of ith the animal model. To test the hypothesis via ANOVA of similarity, we define four different similarity metrics as following.
- Pearson’s Correlation Coefficient: Pearson’s correlation coefficient is calculated by dividing the semi-covariance (4) by the product of the standard deviations of gene expression of a human sample and gene expression of an animal sample. The formula for Pearson’s correlation coefficient is omitted here as it is a simple and standard statistical concept. The advantage of the Pearson’s correlation coefficient over the Euclidean Dis- tance is that it is more robust against data that is not normal- ized.
- Post hoc analysis
A simple way to look at the difference between animal models
is to check the group means of similarities. Typically a larger mean value is indicative of similarity between the animal mod- el to the human disease or cancer. The significant difference of animal models can be evaluated by the mixed ANOVA method. And finally, a multiple comparisons procedure test (Tukey’s test or Hsu’s Best) is conducted to identify the most similar animal model to human disease or cancer.
Logistic Regression Algorithm
Logistic regression is a standard method for building predic- tion models for a binary outcome and has been extended for disease classification with microarray data by many authors [19–23]. The idea of logistic regression is to use linear regres- sion to predict probabilities of class labels. In mathematics, the goal is to minimize the cost function in regularized logistic re- gression C(Φ) defined in (7).
case of four-class classification problem, the binary classifica- tion vectors can be defined as
y(1)=( 1 0 0 0 )T, y(2)=( 0 1 0 0 )T,….. (11)
In this paper, animal samples are used to build an artificial neural network, connecting each input sample to a type of can- cer. Artificial neural networks algorithm is typically a two-step perceptron, forward and back propagation steps. In practice, these networks consist of L layers of networks, that two of them are the input and output layers and the remaining layers are the hidden layers. Each layer has sl a number of units that can be varied in different layers; And as defined in (11), each neural networks typically has K ≥ 3 output units. In forward propagation, perceptron is obtained with implementing the Sigmoid (logistic) function defined in (8) in different layers of the neural network. The back propagation step is used to cal-
(∑ * ( ) ( ( ( )))
( ( )) ( ( ( )))+)
culate the error and consequently optimize the weight vectors. By having multiple nodes on each layer, K separable problems can be classified. Fig. 1 shows a neural network with three hid- den layers; In each input and hidden layer one bias unit is add- ed. Each unit is connected to every unit except the bias unit on the right layer.
A. Training the algorithm
Where Φ is a matrix of weight vectors and �i is the feature vec- tor of each sample. yi is a binary vector defined either 0 (for the case of non-existence) or 1 (for the existence of a criterion) for
In neural networks algorithm, activation vector a(l) for the lth
network layer is defined as
each sample, ω is the regularization factor used for preventing over-fitting the algorithm and hΦ(�) is defined as the Sigmoid
function defined in (8).
( ) ( )
where hΦ is the Sigmoid function and a(1) = � or the input layer.
Defining that, a(K) is the activation of the last layer or response
of the algorithm for the output layer. The aim of the algorithm is to obtain weight vectors or Φ(l) for each layer for better
In equation (8), Φ.� denotes the dot product of weight vector
Φ and the feature vector � and e is the exponential function. Gradient decent method is typically utilized to minimize the cost function (7). To update the weight vectors in each step, the derivative of the cost function with respect to the weight vectors should be computed.
prediction of the output layer. The gradient decent method is used for solving this nonlinear problem.
So, training the neural networks algorithm consists of the following steps:
where β is the step size in gradient decent algorithm. It can be shown that derivative of the cost function with respect to weight vectors is
( ( )) ( )) ( )
Simultaneous updating the weight vectors will result in the reduction of the cost function.
Artificial Neural Networks Algorithm
Artificial neural networks is a method of training an algorithm for classification of a binary input data. For example, in the
Figure 2. An example of a neural network with three hidden layers designed for four class classifications.
- Random initialization of weight vectors for each network layer.
- Minimizing the cost function by updating the weight vector of each layer until reaching the predefined criteria.
- Forward propagation to compute the cost function
- Back propagation to compute the derivative of the cost function with respect to weight vector in each layer
- Update weight vectors .
In the above algorithm, cost function shall be defined as
cross-validation and test groups include 40 and 66 samples respectively. The algorithm with a different number of hidden layers and units is examined with cross-validation samples. Cost efficient and accurate network is selected and tested with the test samples.
Feature selection is the process of selecting a subset of fea- tures, or in this context, genes, in the training classification al- gorithms. The genomic data typically have a smaller number of examples than the number of features or genes that prevents the algorithms to over-fit the data or include the random noise in the training data. But to eliminate the under-fitting, the fea- ture selection is essential. Also, feature selection often increas-
es classification accuracy by reducing the noise pollution that
∑ ∑ * ( ) (
( ( )))
might have infiltrated the data. Therefore, the t-test algorithm
̅ discussed in  is implemented for the sake of dominant fea-
( ( )) (
( ( ))) +
∑ ∑ ∑( ( ))
To compare the performance of the statistical and machine
where m̅ is the number of samples, hΦ is the sigmoid function and ω is the regularization factor. As discussed, the weight vec- tors should be updated by calculating the derivative of the cost function with respect to the weight vectors
learning algorithms, two examples are examined here. In the first example, a set of simulated gene expression for human and animal are generated and the described processes in the previous section are applied on. In the second example, the al-
gorithms are implemented on pediatric medulloblastoma data.
- Simulated data
Back propagation process shall be summarized as
In the simulated data, we generated h human sample and k an-
- Calculate δ(l)=a(l)–y(i) for each layer. 3) Compute δ(L-1),δ(L-2),….,δ(2)
imal models each model with samples. Each sample contains n genes. These data, resembling the gene expression levels, are generated as random normal data N (0, 1). First, animal gene expression levels is generated as
4) It can be shown that
where A1 is the gene expression matrix of the first animal mod- el, H is gene expression matrix of the human, ρ is a scaling fac- tor between 0 to 1, (0 < ρ < 1) and E is random error generated from N (0, 1). Arbitrary animal models can be generated with the same number of genes, however, for the illustration pur- pose, the number of models is restricted to three and specified
- Cross-validation and testing the network
There are some factors in the neural networks algorithm that should be cited during network architecture selection. One of them is the number of hidden layers. The second one is the number of units in the hidden layers supposing that hidden layers have the same number of units, and the third is the reg- ularization factor. Besides, the accuracy of the trained network should be examined. For these reasons, human samples are randomly divided into two groups of cross-validation and test sets. For the real data example discussed in section (VII-B),
in the table (I).
- ANOVA results of the simulated data: For all of these exam- ples, the assumptions of ANOVA i.e. normality and homoge- neity of variance, are tested by Shapiro test and Bartlett test, resulting in satisfaction of ANOVA assumptions. Accumulative means of the proposed examples are tabulated in tables (II),
(III) and (IV). In these tables S – C and Cos represent the results of the semi-correlation and cosine similarity schemes. These tables show that means of similarity of the first model are sig- nificantly higher than the means of other two models.
F tests of ANOVA show that the similarities (by all metrics) of between these three models and human samples are signifi- cantly different (p < 0.0001). And finally results of the Turkey test for the sample in the table (V) shows that model1 is sig- nificantly different from model2 and model3, thus we conclude that model1 is the most similar model to these human samples.
- Logistic regression and artificial neural networks results of the artificial data: After training the LR and ANN algorithms, each human sample is multiplied by the weight vector of each class and the most probable class is regarded as the type of cancer. For solving this problem with the neural networks scheme, regularization factor, and step size are defined as
ω=0, β=0.1 (16)
Table (VI) shows the percentage of accuracy in the result of the logistic regression and the artificial neural networks with one hidden layer and with 50, 100 and 150 units.
Accuracy in TABLE (VI) shows that these algorithms are ca- pable of solving problems with small sample size and a large number of features, provided the feature selection is done.
- Real data
We use a real example to examine the performance of the three methods described in the previous sections. The details of the data used for this analysis is described in sections (VII-C and VII-D).
- Human and animal data
In our previously published paper in the journal Cancer Cell , four different mouse models were generated to mimic subtypes of medulloblastoma (a type of brain cancer). How- ever, this paper cannot answer the question that which animal model is the most accurate model given a set of mouse models and what sub-type of medulloblastoma of each human sam- ple belongs to. In this paper, we will use the same data to de- velop methods that classify the human cancer types using the cross-species genomic data. The animal data are mouse gene expression using 430V2 and HT430PM chips which can be found in NCBI database by accession numbers GSE33199 and GSE33200, respectively. The mouse data consist of four sub-
types of medulloblastoma: 5 samples of normal, 5 samples of the stem, 5 samples of prog and 5 samples of ptch (one sample was damaged in the experiment), each sample contains 45101 probe sets. The human gene expression data are the same data as described in , which consist of 106 samples and 54675 probe sets.
- Mapping cross-species genome data
A mapping procedure is induced to find the orthologous genes between human data and the mouse data. For this reason, Affy- metrix best-match data set available at www.affymetrix.com is utilized to define around 75000 pairs of ortholog-matched genes (probe-sets). Furthermore, a filtering procedure is uti- lized to eliminate the repeated features (genes) in the human and animal samples.
- ANOVA results of the real data: By ANOVA of the four dif- ferent similarity metrics, all these metrics except Euclidean distance show that ptch medullablatoma of the mouse is sig- nificantly closer to the human samples (p < 0.0001, see Table (VII)). By Euclidean distance, ptch slightly larger than that of ptch, but still significantly (p < 0.0001) lower that of the norm and prog, which may be due to that Euclidean distance is less robust to non-normal data.
- Logistic regression results of the real data: Algorithm for im- plementing multi-class logistic regression is described in this paper. After training the algorithm, each human sample is mul- tiplied by the weight vector of each class and the most proba- ble class will be the type of cancer. By doing so, this algorithm is able to find the type of cancer correctly in all of the human cases.
- Artificial neural networks results of the real data: For solving this problem with the artificial neural networks scheme, regu- larization factor, and step size are defined as
ω=0, β=0.1 (17)
Updating the weight vectors is repeated until reducing the cost function with three orders of magnitude. Table (VIII) shows the result of the neural networks with one and two hidden lay- ers, each with 50, 100 and 150 units.
Error in the table (VIII) is defined as the percentage of the accurate results in the cross-validation and test sets. For this problem, one hidden layer with 150 units suffices the efficien- cy and accuracy of the neural network.
Comparison of the Algorithms
In the proposed statistical scheme, post hoc analysis is inev- itable. It means that the statistical analysis is not complete until a multi-step procedure is followed. While in the machine learning algorithms there is a need for parameter selection, i.e. choosing a number of hidden layers, feature selections, and the regularization factor to avoid over or under fitting the data. Since the genomic data typically has a small number of sam- ples and a large number of features (genes), the algorithms are more prone to over-fit the data. For this reason, in the provid- ed examples the regularization factor is chosen as zero. After these selections are done, the multiclass classification is com- plete after training the algorithm.
The ANOVA-based method requires the assumption of nor- mality of the data and homogeneity of the variance, where the machine learning algorithms do not necessitate these require- ments.
The ANOVA scheme only can specify one type of cancer for the human model, while in the LR and ANN algorithms, human samples can be categorized in different types of cancer.
In contrast, ANOVA scheme is a cost efficient scheme and does not require minimization of the cost functions, thus reduces the computational complexity.
In both of the LR and ANN, methods feature filtering and se- lecting are needed to avoid the data noise and find the global minimum. Finding the global minimum requires an optimiza- tion software, while the procedure in the ANOVA scheme is straightforward. Also, avoiding the local minimums and find- ing the global minimum in this process can be challenging when the ratio of a number of features to a number of training samples is high.
Even though logistic regression is easy to implement, multi- class classification is costly if there are more than two classes. ANN algorithm is the most expensive and the most difficult to implement an algorithm but because of the nature of this algo- rithm that allows increasing the number of hidden layers and hidden nodes, this algorithm tends to behave better than the other two schemes.
Three different methods, including the proposed ANOVA based analysis of similarity, logistic regression and artificial neu- ral networks for classification of cancer types via analysis of cross-species gene expression data are investigated in this pa- per. Implementation procedure of each method is described. Benefits and drawbacks of these methods are examined and discussed with an artificial data and a real example of pedi- atric brain tumor. Among these three methods, the proposed ANOVA-based method yields a comparable result with less computation complexity. ANN is a powerful tool in the domain of data analysis. In practice, the samples of the human tumor are very limited as cancer are a rare disease, which will lim- it the power of parametric statistical methods. The power of ANN implementations has been enhanced by training the an- imal data to obtain the predictive model. As next-generation based genomic data becomes cheaper and more available than it is today, these classification methods via analytic of genom- ic data will continue to play an important role in diagnostic, prognostic and predictive software applications in the field of cancer genomics.
The authors declare that they have no competing interests.
- Zheng CH, Huang DS, Kong XZ, Zhao XM. Gene Expression Data Classification Using Consensus Independent Component Analysis. Genomics Proteomics Bioinformatics. 2008, 6(2): 74- 82.
- Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M et al. Classifi- cation and diagnostic prediction of cancers using gene expres- sion profiling and artificial neural networks. Nature Medicine. 2001, 7: 673-679.
- Khan J, Simon R, Bittner M, Chen Y, Leighton SB et al. Gene expression profiling of alveolar rhabdomyosarcoma with cDNA microarrays. Cancer Res. 1998, 58(22): 5009-5013.
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286(5439): 531-537.
- Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS et al. Dis- tinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000, 403(6769): 503-511.
- Bishop CM. Neural Networks for Pattern Recognition. Clar- endon Press, Oxford. 1995.
- Ji W, Zhou W, Gregg K, Yu N, Davis S et al. A method for cross-species gene expression analysis with high- density oli-gonucleotide arrays. Nucleic Acids Research. 2004, 32(11): e93.
- Johnson RA, Wright KD, Poppleton H, Mohankumar KM, Fin- kelstein D et al. Cross-species genomics matches driver muta- tions and cell compartments to model ependymoma. Nature. 2010, 466(7306): 632-636.
- Pounds S, Gao CL, Johnson RA, Wright KD, Poppleton H et al. A procedure to statistically evaluate the agreement of differ- ential expression for cross-species genomics. Bioinformatics. 2011, 27(15): 2098-2103.
- Shamsaei B, Gao C. On the Evaluation of the Most Accurate Pediatric Medulloblastoma Animal Model. JSM Proceedings, Biometrics Section, Seattle,Washington, American Statistical Association. 2015, 3098-3106.
- Mount DW, Putnam CW, Centouri SM, Manziello AM, Pan- dey R et al. Using logistic regression to improve the prognos- tic value of microarray gene expression data sets: application to early- stage squamous cell carcinoma of the lung and triple negative breast carcinoma. BMC Med Genomics. 2014, 7: 33- 39.
- Beane J, Sebastiani P, Steiling K, Dumas YM, Lenburg ME et al. A prediction model for lung cancer diagnosis that integrates genomic and clinical features. Cancer Prev Res. 2008, 1(1): 56- 64.
- Stephenson AJ, Smith A, Kattan MW, Satagopan J, Reuter VE et al. Integration of gene expression profiling and clinical variables to predict prostate carcinoma recurrence after radi- cal prostatectomy. Cancer. 2005, 104(2): 290-298.
- Zhou X, Liu KY, Wong STC. Cancer classification and predic- tion using logistic regression with Bayesian gene selection. J Biomed Inform. 2004, 37(4): 249-259.
- Kim Y, Kwon S, Song SH. Multiclass sparse logistic regres- sion for classification of multiple cancer types using gene ex- pression data. Computational Statistics and Data Analysis. 2006, 51(3): 1643-1655.
- Oustimov A, Vu V. Artificial neural networks in the cancer genomics frontier. Transl Cancer Research. 2014, 3(3): 191- 201.
- Adetiba E, Olugbara OO. Lung Cancer Prediction Using Neu- ral Network Ensemble with Histogram of Oriented Gradient Genomic Features. The Scientific World Journal. 2015, 2015.
- Kawauchi D, Robinson G, Uziel T, Gibson P, Rehg J et al. A mouse model of the most aggressive subgroup of human me- dulloblastoma. Cancer Cell. 2012, 21(2): 168-180.
- Eilers PHC, Boer JM, Ommen GV, Houwelingen HCV. Classification of microarray data with penalized logistic regression. Proceedings of SPIE. 2001, 4266: 187-198.
- Fort G, Lambert-Lacroix S. Classification using partial least squares with penalized logistic regression. Bioinformatics. 2005, 21(7): 1104-1111.
- Nguyen DV, Rocke DM. Tumor classification by partial least squares using microarray gene expression data. Bioinformat- ics. 2002, 18(1): 39-50.
- Shen L, Tan EC. Dimension reduction-based penalized lo- gistic regression for cancer classification using microarray data. IEEE/ACM Trans Comput Biol Bioinform. 2005, 2(2): 166-175.
- Zhu J, Hastie T. Classification of gene microarrays by pe- nalized logistic regression. Biostatistics. 2004, 5(3): 427-443.
- Zhou N, Wang L. A Modified T-test Feature Selection Meth- od and Its Application to the HapMap Genotype Data. Genom- ics Proteomics Bioinformatics. 2007, 5(3-4): 242-249.