A Five-Gene Signature Inferred from Transcriptome Profiling of Homologous Recombination-Mediated DNA Repair Predicts Clinical Outcome of Patients with Cancer
The clinical outcomes of patients with cancer are influenced by both underlying tumor biology and responses to anticancer treatment. With the advance of next-generation sequencing technologies, a large number of cancer genes have been identified across human cancers . However these cancer genes are involved in a limited number of functional pathways that maintain cell survival and determine cell fate, suggesting that a molecular assessment of the functionality of these pathways might provide new opportunities to develop prognostic markers . Given the important role of HR repair in cancer biology and treatment, an intriguing question is whether molecular markers of the HR repair pathway might serve as predictors of clinical outcomes in cancer patients. However, the enormous complexity of the HR repair network makes it difficult to identify individual biomarkers reflecting the functionality of the HR repair network that have the power to stratify patients according to outcome. The HR repair process involves a variety of proteins, which function in a hierarchically ordered and mutually coordinated manner to detect, signal, and repair DSBs [4, 10-12]. Dysfunction of any proteins in the network may directly or indirectly cause HR repair deficiency. Thus, it would be virtually impossible to test for defects in every HR gene in cancer patients by conventional single-gene methods. To overcome this hurdle, we previously conducted transcriptome profiling in a well-defined cell model with depletion of key HR repair genes including BRCA1, RAD51, and BRIT1 (MCPH1) . We found that the expression levels of 230 genes were commonly altered in cells with different causes of functional impairment of HR repair . However, the gene signature containing 230 genes is too large to be an efficient biomarker for clinical applications. Moreover, the genes in this 230- gene signature may not have equivalent clinical relevance.
The goal of this study was to develop a clinically applicable gene signature, based on the gene expression changes we previously identified in cells deficient in HR repair, able to robustly predict clinical outcomes of cancer patients. To this end, we used mathematical and statistical models to refine our previously identified gene signature and identified a core five-gene signature that was associated with the functionality of the HR repair pathway. We used multiple cancer patient cohorts, including cohorts of patients with breast cancer, ovarian cancer, and lung cancer, to systematically validate this five-gene signature as a prognostic marker for patient survival.
Materials and Methods
Publicly Available Expression Data
Two independent sets of data on breast cancer patients (Netherlands Cancer Institute [NKI]  and University of North Carolina [UNC] ź3R / cohorts), one set of data on lung cancer patients , and one set of data on ovarian cancer patients  containing both genome-wide expression data and patient survival data were used for survival analyses. The NKI breast cancer dataset was used as the training set (discovery set), and a Cox proportional hazards model was used to identify subsets of genes that were significantly associated with overall survival. The UNC breast cancer dataset and the lung and ovarian cancer datasets were used as the testing set (validation sets). The NKI dataset consisted of 106 gene expression measurements across 295 breast cancer patients. During follow-up, 79 patients died, while 216 (73.2%) were censored. The UNC dataset consisted of 162 gene expression measurements from 158 patients. During follow-up, 43 patients died, and 115 (72.8%) were censored.
Identification of Core Gene Set
All survival analyses were performed using SAS 9.3 software (SAS Institute Inc, Cary, NC). Cox proportional hazards models are used in survival analyses to assess the significance of various covariates in terms of the survival times of individuals through the hazard function. We performed geneby- gene Cox proportional hazards modeling with the 230 genes in our previously identified HR repair signature  for patients in the NKI dataset using the procedure “PROC PHREG” with “selection=score.” The selection=score parameter specifies the use of the branch-and-bound algorithm of Furnival and Wilson as an alternative to stepwise variable model building to find a specified number of models with the highest likelihood score (chi-square) statistic for all possible model sizes. From this analysis, 13 genes were selected as our “target set 1.”
We also used an alternative approach in which we iteratively applied “selection=score” to the NKI dataset to further refine the core genes associated with patient survival times. In each iteration, the previously selected genes were used as the pool and “selection=score” was applied to generate the next pool. First, from the 106 genes with chi-square and p values indicating significant association with overall survival, we selected the top 90 genes. Then, from these 90 genes we selected 70, from these 70 genes we selected 50, from these 50 genes we selected 30, and from these 30 genes we selected 15. From these 15 genes, we selected the five genes that appeared both in this set and in our target set 1 to obtain our “target set 2”.
Gene Expression Heat Maps and Clustering-Based Survival Analysis
We used the R statistical software (http://cran.r-project.org/) “gplots” package with command “heatmap.2” to generate heat maps from both target sets. To normalize the magnitude of differences in gene expression, we transformed log2- reported expression values into percentiles. For example, if is the value corresponding to the patient and gene, and is the data array corresponding to the gene, then is used in place of in our analysis.
In all cases, hierarchical clustering of gene expression data revealed a major split into two patient subgroups. We used the SAS code “PROC LIFETEST” (a log-rank test) to determine the significance of differences in survival times between the patient subsets.
Ingenuity Systems’ IPA software (Qiagen) was used to perform pathways analysis for each of the gene target sets identified.
Because dysfunctional HR repair due to BRCA1 and/or BRCA2 mutations increases the risk of breast cancer, we utilized breast cancer as our disease model to systematically identify which genes in our previously defined 230-gene signature associated with HR repair deficiency  were prognostic for overall survival. The well-characterized NKI breast cancer patient dataset was used as the discovery dataset to identify core HR repair genes significantly associated with overall survival among breast cancer patients [14, 15].
For each gene in our 230-gene signature for which information was available in the NKI dataset (a total of 106 genes), correlations with overall survival time of breast cancer patients (Figure 1A) were ranked by statistical significance. The 13 genes with a chi-square statistic above 29 were selected as our “target set 1” (Figure 1B). Pathways analysis (Figure 1C and D) showed that these genes are involved in multiple cellular processes, including DNA replication, recombination, and repair; nucleic acid metabolism; and cell cycle regulation. In addition, a large proportion of the genes were in canonical pathways involved in mismatch repair in eukaryotes, mitotic roles of polo-like kinase, and salvage pathways of pyrimidine deoxyribonucleotides, suggesting that these 13 genes recapitulated the biological functional spectrum of the original 230 genes .
Figure 1. Cox proportional hazards regression analysis identifies 13 genes most significantly associated with breast cancer survival. (A) Plot of chi-square scores for significance of association with breast cancer survival time for the 106 genes in our 230- gene signature for which information was also available in the NKI dataset (n=295 patients). (B) Names and chi-square scores of the 13 genes in panel A with chi-square scores above 29. (C) Top 15 canonical pathways of the 13 genes in panel B using Ingenuity Systems IPA software. The –log(p value) is obtained by the Ingenuity program using Fisher’s exact test. The threshold for significance is P=0.05.Different mathematical algorithms may lead to the selection of divergent sets of genes from the same discovery dataset. To take advantage of the power of multiple statistical analysis methods and, more importantly, to reduce biases that might result from use of only one mathematical algorithm or statistical model, we also applied recursive analysis. Using an iterative process of stepwise reduction in the number of genes included in the Cox proportional hazards model, we again analyzed the NKI breast cancer dataset to define a 15- gene signature (Fig. 2A). An IPA analysis of this gene set (S1 Figure.) demonstrated that these 15 genes are involved in biological pathways similar to those in which the target set 1 genes are involved, suggesting similar biological functions. One notable difference was that genes involved in canonical pathways of telomere extension by telomerase and telomerase signaling were identified in this iterative analysis but were not present in target set 1.
Figure 2. Recursion analysis identifies five genes most significantly associated with breast cancer survival. (A) Recursion analysis was used to identify the 15 genes most significantly associated with breast cancer survival time in the NKI dataset (n=295 patients). (B) Venn diagram indicating the five genes identified as being significantly associated with breast cancer survival in both Cox proportional hazards and recursion analysis. (C) Top 3 canonical pathways of the five genes using Ingenuity Systems’ IPA software. The –log(p value) is obtained by the Ingenuity program using Fisher’s exact test. The threshold for significance is P=0.05.
We then selected the five genes identified by both Cox proportional hazards modeling and iterative analysis (Fig. 2B). Pathways analysis revealed that these five genes are involved in pathways similar to those in which genes from the previously identified 13-gene and 15-gene signatures are involved (Fig. 2C and D). Of these five genes, three are known to be involved in DNA repair, HR, and cell cycle regulation. EXO1 is an evolutionarily conserved gene that encodes a protein with 5’ to 3’ exonuclease activity as well as RNase H activity . It plays an important role in regulating mismatch repair and HR repair . TRIP13 (thyroid hormone receptor interactor 13) is a hormone-dependent transcription factor that interacts with thyroid hormone receptors. It has a key function in regulating chromosome recombination and chromosome structure development during meiosis [18, 19]. PRC1 (protein regulating cytokinesis) plays an important role in regulating cytokinesis and is present at high levels during the S and G2/M phases, where DSBs are dominantly repaired by the HR repair pathway [20, 21]. The other two genes in our five-gene signature, ADM and LRP8, have no reported function directly involved in maintaining genomic stability. ADM (adrenomedullin) functions as a hormone in circulation control, including in vasodilation, regulation of hormone secretion, promotion of angiogenesis, and antimicrobial activity [22-24]. LRP8 (low-density lipoprotein receptor-related protein 8) encodes a protein of the low-density lipoprotein receptor family, which functions as a signaling mediator of the Reelin pathway during neuron migration and development and also as a receptor for the cholesterol transport protein apolipoprotein E .
We next sought to validate this five-gene signature using the independent UNC breast cancer patient cohort. To this end, we used the original 230-gene signature, the 13-gene signature (target set 1), and the five-gene signature to cluster patients from both the NKI (discovery set) and UNC (validation set) cohorts on the basis of the expression patterns of the genes in these various signatures.
Figure 3. The gene signatures associated with the HR repair pathway stratify breast cancer patients into two groups on the basis of mRNA expression levels of genes in the signature. In each heat map, each column represents the gene expression profile of 1 patient, and each row represents the expression level of 1 gene. (A and B) Our previously reported 230-gene signature  was used to stratify patients from the NKI breast cancer cohort (A) and the UNC breast cancer cohort (B). (C and D) The 13-gene signature was used to stratify patients from the NKI breast cancer cohort (C) and the UNC breast cancer cohort (D). (E and F) The five-gene signature was used to stratify patients from the NKI breast cancer cohort (E) and the UNC breast cancer cohort (F).
Supervised clustering analysis using the 230-gene signature stratified patients in the NKI and UNC cohorts into two distinct groups (Fig. 3A and B). However, the gene expression patterns for individual patients in both groups were rather heterogeneous, i.e., individual patients had both upregulated and downregulated genes. Supervised clustering analysis using the 13-gene signature (Figure 3C and D) and five-gene signature (Fig. 3E and F) also stratified patients into two distinct groups, but for both gene signatures, many individual patients had clearly dominant gene overexpression or clearly dominant gene downregulation. The number of patients with clearly dominant gene overexpression or gene downregulation was even greater for the five-gene signature than for the 13-gene signature.
Figure 4. The 13-gene and five-gene signatures predict overall survival in independent breast patient cohorts.(A and B) Patients from the NKI cohort (A) and the UNC cohort (B) were stratified by the 13-gene signature into two groups. Kaplan-Meier overall survival curves are shown. P values are from log-rank test. (C and D) Patients from the NKI cohort (C) and the UNC cohort (D) were stratified by five-gene signature into two groups. Kaplan-Meier overall survival curves are shown. P values are from log-rank test.
We next tested, for both the 13-gene signature and the fivegene signature, whether overall survival time differed between the groups with high and low gene expression identified by supervised clustering analysis. (For the 230-gene signature, we had previously reported that cancer patients with higher DNA repair capacity exhibited a poorer prognosis than cancer patients with lower DNA repair capacity .) We selected from the NKI dataset the 36 patients with highest gene expression levels and the 26 patients with the lowest gene expression levels (Figure 3C), and we selected from the UNC dataset 21 patients with high gene expression levels and 33 patients with low gene expression levels (Figure 3D). As shown in Fig. 4A and 4B, for the 13-gene signature, in both datasets, patients with higher expression levels of these genes had significantly worse overall survival than patients with lower expression levels of these genes. Interestingly, for the five-gene signature, we also observed that patients with higher gene expression levels had significantly worse overall survival than patients with lower gene expression levels, suggesting that the five-gene signature was as effective as the 13-gene signature in terms of stratifying patients with respect to prognosis. Notably, the five-gene signature had stronger prognostic value than the 13-gene signature in the UNC cohort, in which the median survival time was less than 40 months.
To test whether these classifications were generalizable to other cancers, we repeated our heat map and survival analyses using microarray datasets from an 87-patient ovarian cancer cohort (Fig. 5) and a 500-patient lung cancer cohort (Fig. 6). Consistent with our observations in breast cancer, patients with high expression of genes in the signatures had significantly worse overall survival than patients with low gene expression; furthermore, the five-gene signature again outperformed the 13-gene signature as a prognostic indicator in both ovarian and lung cancer (Fig. 5D vs 5B and Fig. 6D vs 6B). These results suggest that the five-gene signature associated with defective HR repair predicts the overall survival of cancer patients.
Figure 5. The 13-gene and five-gene signatures predict overall survival in an ovarian cancer (OVC) patient cohort. (A and B) Patients were stratified by 13-gene signature into two groups (A). Kaplan-Meier overall survival curves are shown (B). P values are from log-rank test. (C and D) Patients were stratified by five-gene signature into two groups (C). Kaplan-Meier overall survival curves are shown (D). P values are from log-rank test.
Figure 6. The 13-gene and five-gene signatures predict overall survival in a lung cancer patient cohort. (A and B) Patients were stratified by 13-gene signature into two groups (A). Kaplan-Meier overall survival curves are shown (B). P values are from log-rank test. (C and D) Patients were stratified by five-gene signature into two groups (C). Kaplan-Meier overall survival curves are shown (D). P values are from log-rank test.
In this study, we used a systems biology approach to identify a gene expression signature that is associated with a functionally impaired HR repair pathway and predicts reduced overall survival among patients with breast, lung, or ovarian cancer. The novelty of our approach lies in the development of a critical DNA-repair gene expression signature from a well-defined cell model with impaired HR repair; subsequent refinement of that core gene signature to a simplified five-gene signature using complementary statistical methods; and, finally, validation of this five-gene signature using independent patient cohorts across multiple cancer lineages.
The HR repair pathway represents a fundamental DNA repair mechanism used by cells to maintain genomic integrity and prevent mutagenesis and tumorigenesis. The key HR repair genes BRCA1 and BRCA2 are among the most prominent cancer genes, and mutations of BRCA1 and BRCA2 are associated with significantly increased risks of breast and ovarian cancer and worse prognosis in patients with these diseases . Mutations in other genes involved in the HR repair network might also play a significant role in cancer risk . Because of the complexity of the HR repair pathway, in this study, rather than searching for individual genes involved in HR repair, we attempted to identify a gene signature that correlates with functional deficiency of the HR repair pathway that could be used as a marker of prognosis. The expression of the genes in our five-gene signature appears to provide a surrogate measure to identify a subgroup of patients with reduced HR repair function, which corresponds to a worsened patient outcome. Notably, we validated our results identified from breast cancer cohorts in two independent lung cancer and ovarian cancer patient cohorts. Our findings suggest that our five-gene signature is not specific to a certain type of human cancer, which is consistent with the fundamental role of HR repair in cellular responses to DNA damage. However it remains to be further explored whether we can identify cancer-type-specific gene signatures that correlate with clinical outcomes in different cancers. The existence of such gene signatures is possible since different tissue origins may preferentially utilize certain DNA repair pathways to counteract genomic damage.
Triple-negative breast cancers, which are associated with mutations in BRCA1, progress more rapidly than other subtypes of breast cancer, suggesting that HR deficiency promotes the development of genomic instability, which provides a genetic driving force for tumorigenesis. Thus, the five-gene signature associated with the function of the HR repair pathway may have the potential to be used as a prognostic marker for tumor progression. Therefore, we plan to test whether the five-gene signature can be used as a riskprediction marker for progression of ductal carcinoma in situ to invasive breast cancer.
A recent study showed that PARP inhibitors can be used as effective chemopreventive agents to delay mammary tumor development in BRCA1-deficient mice . This finding suggests a new targeted chemoprevention approach in the high-risk population with well-defined genetic alterations such as BRCA1 and BRCA2 mutations. However, the chemopreventive benefits of PARP inhibitors may not be limited to patients with BRCA1/BRCA2 mutations; patients with a functionally “impaired” HR pathway, as estimated using our five-gene signature, might also benefit from PARP inhibitors. However, more studies are required to test this concept.
mRNA-based prognostic molecular profiles have been successfully used in clinical research. One example is the 21- gene Oncotype DX quantitative PCR assay, which was generated from a 25,000-gene microarray study [37, 38]. Oncotype DX is used to estimate the risk of recurrence in patients with estrogen receptor-positive and lymph node-negative breast tumors. The 70-gene MammaPrint profile was tested as a prognostic marker in a prospective clinical trial in combination with clinicopathological features . The Rotterdam profile contains a 76-gene signature; most of these genes are involved in proliferation-associated signaling pathways . A 186-gene signature identified from CD44 high/CD24 low tumor cells shows an enrichment of genes involved in invasiveness and metastasis . The PAM50 marker set consisting of 50 genes has been shown to be the most powerful predictor yet available. These studies demonstrate that use of mRNA-based signatures is a clinically applicable approach to predict the clinical outcomes of cancer patients. We believe that the simplified five-gene signature identified in our current study would be ideally suited for development into a clinical assay. However, although the method has been optimized in a cohort of 295 breast cancer patients, further extensive validation is necessary prior to clinical investigations.
In summary, this study used a systems biology approach to identify the core genes involved in a defective HR DNA repair network. We report a five-gene signature that was associated with overall survival among patients with breast, ovarian, and lung cancers. This unified classifier has the potential to guide targeted chemoprevention strategies and to alter the choice of treatments across the field of oncology.
9.Jones S, Zhang X, Parsons DW, Lin JC, Leary RJ et al. Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science. 2008, 321(5897):1801- 1806.
12.Peng G, Lin SY. Exploiting the homologous recombination DNA repair network for targeted cancer therapy. World J Clin Oncol. 2011, 2(2): 73-79.
23.Kirisci M, Oktar GL, Ozogul C, Oyar EO, Akyol SN et al. Effects of adrenomedullin and vascular endothelial growth factor on ischemia/reperfusion injury in skeletal muscle in rats. J Surg Res. 2013,185(1): 56-63.
30.Escribano-Diaz C, Orthwein A, Fradet-Turcotte A, Xing M, Young JT et al. A cell cycle-dependent regulatory circuit composed of 53BP1-RIF1 and BRCA1-CtIP controls DNA repair pathway choice. Mol Cell. 2013, 49(5): 872-883.
31.Chapman JR, Barral P, Vannier JB, Borel V, Steger M et al. RIF1 is essential for 53BP1-dependent nonhomologous end joining and suppression of DNA double-strand break resection. Mol Cell. 2013, 49(5): 858-871.
34.Bouwman P, Aly A, Escandell JM, Pieterse M, Bartkova J et al. 53BP1 loss rescues BRCA1 deficiency and is associated with triple-negative and BRCA-mutated breast cancers. Nat Struct Mol Biol. 2010,17(6): 688-695.
36.To C, Kim EH, Royce DB, Williams CR, Collins RM et al. The PARP inhibitors, veliparib and olaparib, are effective chemopreventive agents for delaying mammary tumor development in BRCA1-deficient mice. Cancer Prev Res. 2014, 7(7): 698-707.
38.Goldstein LJ, Gray R, Badve S, Childs BH, Yoshizawa C et al. Prognostic utility of the 21-gene assay in hormone receptor- positive operable breast cancer compared with classical clinicopathologic features. J Clin Oncol. 2008, 26(25): 4063- 4071.
41.Dowsett M, Sestak I, Lopez-Knowles E, Sidhu K, Dunbier AK et al. Comparison of PAM50 risk of recurrence score with oncotype DX and IHC4 for predicating risk of distant recurrence after endocrine therapy. J Clin Oncol. 2013, 31(22): 2783-2790.