|
|
||||||||
1 Radiation Biology and Environmental Toxicology Group, Department of Cell and Molecular Biology, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720
2 Chiron Corporation, Emeryville, California 94608
3 Gene Logic Incorporated, Berkeley, California 94704
| ABSTRACT |
|---|
|
|
|---|
marker genes; mixture of experts; support vector machines; adipsin; cystatin C; azurocidin
| INTRODUCTION |
|---|
|
|
|---|
A variety of techniques have been employed to address three statistical tasks associated with analysis of profile data (3, 9, 11, 13, 1618, 2022). The first, unsupervised learning, involves discovering and characterizing the classes present in unlabeled profile vectors. This clustering procedure can suggest previously unrecognized cancer (sub)types. The second task, supervised learning, involves discriminating between profile vectors with different labels and assigning the label of a new profile vector. Given profiling data for a sample of unknown origin, this classification and prediction procedure can indicate the origin of the sample, for example, whether it is from tumor or nontumor tissue. The third task and subject of this work is feature relevance, ranking, and selection. This involves defining a feature relevance expert which 1) implements an algorithm that quantitates the degree to which a gene distinguishes samples, 2) reorders genes according to this relevance value, 3) selects nested subsets of ranked genes and uses them to train a supervised learning system, and 4) identifies highly informative or marker genes based on the ability of subsets to assign accurately the label for samples not used for training, i.e., the generalization performance of the subset. Thus, gene subsets corresponding to marker genes can be identified by varying a single parameter, the number of ranked features used to train and evaluate the supervised learning system. For a given data set, different feature relevance experts can be compared via their generalization performance on the same number of ranked genes.
Recently, two independent studies employed different techniques to address the three aforementioned tasks. The first study applied naive Bayes models, support vector machines (SVMs) and naive Bayes global relevance (NBGR) (16) to sixty-two 1,988-feature experiment profile vectors derived from colon adenocarcinoma samples labeled as tumor or nontumor (2). The NBGR requires unlabeled profile vectors as input, since it is computed from the probability parameters of profile vector classes discovered by a naive Bayes model. The second study applied self-organizing maps (SOMs), neighborhood analysis, and weighted voting and gene/class correlation to seventy-two 7,070-feature experiment profile vectors derived from bone marrow (BM) and peripheral blood (PB) samples labeled as acute lymphoblastic or myeloid leukemia (ALL, AML) (11). The relevance measure, referred to here as the mean aggregate relevance (MAR), requires labeled profile vectors, since it is computed from the mean and standard deviation of the expression levels of genes in samples labeled ALL and AML. For the {Tumor, Nontumor} and {ALL, AML} binary supervised learning problems, each study identified 50 markers that had the same generalization performance as the full repertoire of, respectively, 1,988 or 7,070 genes (11, 16).
This work considers three distinct but interrelated feature relevance-, ranking-, and selection-related problems. Currently, the number of training examples, N sample profile vectors, is considerably smaller than their dimensionality, L measured gene expression levels (N << L). The first problem is identifying P marker genes for development of a robust decision support system to assign the cancer (sub)type for a new sample as accurately as or better than the original L genes (P << L). The second problem involves reducing the dimensionality even further by defining the Q marker genes best-suited for subsequent experimental investigations (Q < P << L). The third problem concerns multiply-labeled profile vectors and increasing the utility of profiling studies beyond their original purpose. Apart from the primary ALL and AML labels, each leukemia sample had 13 additional labels: {PB, BM}, {T cell, B cell}, and {Male, Female} (11). Since it is unlikely that all 7,070 genes are involved in differentiating ALL from AML, it is possible that some (or all) could provide a readout on other aspects of the samples. The question becomes whether the L genes analyzed to address a primary supervised learning problem can be employed to identify markers for secondary problems defined by additional sample labels. Here, a mixture of feature relevance experts is used to address the first and second problems. The validity of the premise underlying the third problem is demonstrated using data from the leukemia samples. Since submission of this work, a variety approaches for identifying marker genes have been proposed (see, for example, Refs. 7, 8, 10, 12, and 23).
| METHODS AND APPROACH |
|---|
|
|
|---|
Although not examined in this work, a variety of other supervised learning problems can be derived from the leukemia data. For example, a binary problem might involve distinguishing leukemia subtypes on the basis of tissue of origin ({ALL+PB, ALL+BM} or {AML+PB, AML+BM}). A multiclass problem could include discriminating samples on the basis of tissue origin and subtype ({ALL+PB, ALL+BM, AML+PB, AML+BM}). For convenience, each functionally defined nucleic acid sequence probe whose expression level is monitored will be termed a "gene," irrespective of whether it is actually a gene, an expressed sequence tag, or DNA from another source.
Feature relevance experts.
The three feature relevance experts evaluated here implement relevance measures that are based upon labeled (MAR, MVR) or unlabeled (NBGR) sample profile vector training examples. These measures are designed to be illustrative rather than comprehensive, because, for example, all treat genes as independent of one another, whereas the transcription levels of some genes are likely to be correlated. In general, each measure generates a ranking of features and defines nested gene subsets Top1
Top2
...
TopL, where L is the number of genes monitored in the profiling study (here L = 1988, 7070). Top1 denotes the top-ranked or most distinctive gene according to the relevance measure, Top2 denotes the top 2, and so on. Evaluating all possible gene subsets in terms of how well they perform on a particular classification and prediction problem using a supervised learning method (here SVMs) is a computationally demanding task. Hence, the focus is on a small number of selected gene subsets, for example, Top4
Top5
Top11
Top25
Top50
Top100
TopL, as well as the bottom and middle 50 ranked genes.
It remains to be determined whether the degrees of difficulty of the supervised learning and feature selection problems posed by the leukemia and adenocarcinoma data sets are typical of cancer profiling studies. The strategy deployed here is sufficiently general that other feature relevance measures, ranking and selection techniques, supervised learning methods, training and evaluation procedures, and methods for combining predictions from experts could be utilized.
Median vote relevance.
For gene Fl, let xln be its expression level in sample n. Let
i(Fl) and
j(Fl) be the median values for samples belonging to classes i (positive training examples) and j (negative examples). Each sample casts a vote V(n, l) according to whether the expression level is closer to the median value of class i or j. The median vote relevance (MVR) is the sum over all N samples
|
|
Naive Bayes global relevance.
Given the K classes identified and characterized by a naive Bayes model estimated from N unlabeled L-feature profile vectors, the NBGR (16) is the sum of the relevance over pairwise combinations of classes
![]() |
P(xln|ck,l) is the probability of the expression level given class k. The greater the absolute magnitude, the better the gene distinguishes all K classes. A naive Bayes model was estimated using AutoClass C version 3.3 (5) and the 72 unlabeled 7,070-feature leukemia sample profile vectors (the reported expression values were not shifted or scaled in any way). An expectation maximization algorithm finds a mixture of Gaussian probability distributions, and a Bayesian approach finds the maximum posterior probability classification and optimum number of classes K. Thus, P(xln|ck,l) = [2
k,l2]-1/2exp[-1/2{(xln - µk,l)/
k,l}2] where [µk,l,
k,l] is the [mean, standard deviation] of the Gaussian modeling class k. For each feature, gene l, a lower bound for
k,l was set to 1/10 of the standard deviation of all N expression levels, {xl1, ... , xlN}.
A naive Bayes model of the adenocarcinoma experiment profile vectors identified four underlying classes (16) rather than the two indicated by the tumor and nontumor labels (2). NBGR values were calculated using Gaussian parameters determined directly from the values of gene Fl in the tumor and nontumor samples, i.e., K = 2. The generalization performance of the top 50 genes from this "supervised" NBGR expert was considerably worse than that of the top 50 from an "unsupervised" NBGR expert that employed the K = 4 classes estimated from data.
Mean aggregate relevance.
This is the correlation between a gene and the ALL/AML classes (11). Unlike the MVR, the MAR utilizes both the location and spread of samples in classes i and j
![]() |
i,l] and [µj,l,
j,l] are the mean and standard deviation of the log of the expression level of gene Fl in classes i and j. A large absolute magnitude signifies a strong correlation. A positive (negative) sign indicates that the gene is more highly expressed in class i (j). MAR(Fl) is related to the Fisher criterion score |(µi,l - µj,l)/(
i,l2 +
j,l2)|.
Leukemia and adenocarcinoma genes: feature ranking and selection.
The 7,070 genes in the leukemia data were ranked separately according to their NBGR value and MVR value for the labels {ALL, AML}, {PB, BM}, {T cell, B cell}, and {Male, Female} (a total of five different rankings). The 1,988 genes in the adenocarcinoma data were ranked separately accord-ing to their NBGR value and MVR value for the label {Tumor, Nontumor} (two different rankings). For each of these seven rankings, nine representative gene subsets were created by selecting different numbers of top-, middle-, and bottom-ranked genes. Two additional gene subsets based on the {ALL, AML} labels were defined. The first, taken from figure 3A of Ref. 11 and referred to as the MAR 50, represents the 25 genes with the highest positive values and the 25 genes with the highest negative values. The second subset consists of genes common to the MAR 50, the NBGR top 50, and the MVR top 50. For the multiply-labeled leukemia data, the NBGR ranking reflects the importance of genes in distinguishing ALL from AML, so it may be uninformative in terms of the other labels.
SVMs: training and evaluation.
Because of the limited number of training examples, a leave-one-out cross validation strategy was utilized. A pool of N known positive and negative training examples was partitioned into two disjoint sets (here N = 62, 72). The estimation set, N - 1 examples, was used to determine the parameters of an SVM, and the test set, 1 example, was used to assess its generalization performance. The label assigned by a trained SVM to a test example can be a true positive (known positive test example, assigned positive label), true negative (negative example, negative label), false positive (negative example, positive label), or false negative (positive example, negative label). This procedure was repeated for each training example in turn. The generalization performance of these leave-one-out studies is the total number of SVMs that make true positive or true negative assignments (the maximum possible generalization performance is N). Elsewhere (11), the 72 leukemia training examples were partitioned into estimation and test sets containing 38 and 34 examples, respectively. The generalization performance of this "38 estimation, 34 test" partitioning is how many of the 34 test examples were assigned to be true positives or true negatives. The roles of the two sets were then reversed, and the generalization performance of a "34 estimation, 38 test" partitioning was determined in a similar manner.
In addition to training examples, estimating an SVM requires specifying an inner-product kernel function, a measure of similarity between two profile vectors XLi = {x1i, ... , xLi} and XLj = {x1j, ... , xLj}. Since there is no general theory for determining the most appropriate kernel for a particular learning problem, two kernels were employed. The first was the dot product kernel K(XLi, XLj) =
l=1L xlixli. The second was a radial basis kernel function K(XLi, XLj) = exp(-||XLi - XLj||2/2
2), where
= 1/2
2 is a user-defined width parameter. Two different width parameters were used:
f = 0.01, a data-independent value employed in earlier work (16); and 2)
d, a data-dependent value in which
is set equal to the median of the Euclidean distances from each positive training example to the nearest negative training example (3).
SVMs were trained and evaluated using SVMlight version 3.02 (15). Each gene subset was employed to create training examples in which the input profile vectors contained only the selected genes. Rather than working directly with the reported expression levels, xln, each value was normalized using xln/[
l
S(xln)2]1/2 where S is the subset of interest. For simplicity and to illustrate the basic approach, genes were ranked once using all N training examples and not reranked for each estimation set. To account for unequal numbers of positive and negative examples, each estimation set was balanced by duplicating as many randomly chosen examples as necessary from the smaller set to yield the same number of examples as the larger set. Elsewhere (3), imbalanced data sets were handled by adding a diagonal to the kernel matrix (different values for positive and negative examples).
| RESULTS |
|---|
|
|
|---|
|
|
d gives superior results compared to the data-independent parameter
d for the {ALL, AML} problem. The reverse is true for the {Tumor, Nontumor} problem. The poorer performance of a data-dependent width parameter for the {Tumor, Nontumor} problem may be due to the larger number of potentially misclassified examples in the adenocarcinoma versus the leukemia data set. In previous analysis of the adenocarcinoma data (16), training examples that constituted support vectors in each of the 62 leave-one-out SVMs were used to pinpoint potentially mislabeled samples (support vectors are training examples that define the location of the decision surface). Similarly, it may be instructive to examine how the nature and number of such invariant support vector training examples vary according to feature subset, kernel function, and kernel parameters.
SVM training and evaluation.
Table 3 indicates that performance is influenced by how the training examples are partitioned (compare the false positives and false negatives in the "38 estimation, 34 test" and "34 estimation, 38 test" experiments). The MAR 50 subset and "38 estimation, 34 test" partitioning allow a direct comparison between the performance of SVMs and the published weighted vote predictor (11). In the latter, the estimation set was used to compute the MAR for each feature in the subset. This 50-feature predictor assigned the label for each of the 34 test examples as follows. Each gene Fl casts a weighted vote according to whether the expression level xl is closer to the value of the gene in class i
ALL or j
AML of the estimation set, v(Fl) = MAR(Fl)(xl - [µi,l + µj,l]/2). If the sum of the absolute values of the positive votes in the 50 genes is greater than the sum of the absolute values of the negative votes, then the test example is assigned to the positive class i. The weighted vote predictor made strong predictions for 29 of the 34 test examples, and in all instances, the assignments were true positives or true negatives. In contrast, an SVM makes true positive or true negative assignments for 33 of the 34 test examples.
|
|
|
|
|
|
T cell/B cell, PB/BM, and male/female markers for experimental studies.
The MVR expert defines 25 markers for each of the additional leukemia problems that generalize as well as all 7,070 genes (Table 9). Comparing the maximum performance achieved and the maximum possible performance indicates that the data contain sufficient information for the {PB, BM} (68 vs. 72) and {T cell, B cell} (46 vs. 47) problems, but not for the {Male, Female} (31 vs. 49) problem. Furthermore, there is little difference in performance between the {Male, Female} top 50, middle 50, and bottom 50 gene sets. This suggests little association between these sample labels and transcription profiling data. Possible explanations for the poorer {Male, Female} results include 1) transcription profile data are poor indicators of sex, 2) the 7,070 probe set did not include those that can distinguish males from females, and 3) the patients (mostly children) had not achieved sexual maturity and thus not manifested any differences.
|
The three sets of MVR rankings appear to be biologically interesting (Tables 1012). It should be noted, however, that they are valid only within the context of tissue samples derived from patients with ALL/AML. Bearing this in mind, the {T cell, B cell} top 50 contains many known T cell related genes. Genes that have no obvious annotation linking them to this cell type, such as protein disulfide isomerase, selenoprotein W, and Ras-related protein Rab-32, may be novel markers that can discriminate between T cells and B cells. Selenoprotein W is an intracellular protein that may be involved in protection against oxidative damage and muscle metabolism (4, 14). Overexpression of Lrp, the top ranked {PB, BM} gene, often predicts a poor response to chemotherapy in leukemia because it is one of the mechanisms by which cancer cells develop resistance to cytotoxic agents (reviewed in Ref. 19).
|
|
|
| DISCUSSION |
|---|
|
|
|---|
Although reducing the original 7,070 leukemia genes to 125 is appropriate in terms of a decision support system, this is still too many for in-depth experimental studies. Hence, the most informative experimental markers may be genes at the intersection of the top ranked genes: adipsin, azurocidin, and cystatin C. However, they are unlikely to be the sole determinants of the difference between ALL and AML because the generalization performance of these three genes is poorer than some of the larger gene subsets. The same is true for the four closely linked genes on chromosome 19p13.3 (azurocidin-proteinase 3-neutrophil elastase-adipsin). Nonetheless, the strategy proposed here provides a protocol for pinpointing experimentally informative marker genes and thus prioritizing subsequent investigations.
In transcription profiling studies, more genes are monitored than are probably required to understand the main problem. This "overdetermined" property suggests that broader questions could be answered if additional information were available for each sample. For the leukemia {T cell, B cell} and {PB, BM} secondary problems, the 7,070 genes are sufficiently informative that 25 markers can be defined that generalize as well as all 7,070 genes. It remains to be determined whether these makers are universal or are restricted to samples originating from ALL and AML patients.
Both the leukemia and adenocarcinoma data sets contain potentially misclassified samples, samples for which the original label (the "gold standard") may be incorrect (1/72 and 6/62 respectively). In a previous study of the latter data set (16), the subset of training examples that constituted support vectors across the entire series of leave-one-out SVMs was suggested to be indicative of samples most likely to have been misclassified (the set of support vectors does appear to depend upon which training example is withheld when estimating an SVM). Misclassification may be due to simple human error during sample handling, RNA preparation, data acquisition, data analysis, and so on. Standardized protocols stipulating rigorous procedures at each step of the process should reduce this type of problem and improve the chances of creating a coherent data set. The possibility of misclassification cannot be eliminated entirely because although a sample might appear to be visually and/or histologically of one type, it might be a member of the other class in reality. By training SVMs with hard margins, assuming no a priori labeling errors, potentially mislabeled samples can be pinpointed and subjected to additional investigation to verify their label. Given the nature of the underlying biology and technical issues surrounding generation of transcription profiling data, it is conceivable that many, it not all, cancer profiling experiments will contain noisy data and misclassified samples. Soft margin SVMs do take into consideration misclassified training examples but it is difficult to estimate the underlying error rate at the present time. To improve the reliability of downstream analyses, it may be preferable to incorporate a preprocessing step that identifies, and subsequently corrects if necessary, any misclassified samples. Once achieved, the distance of a sample to the optimal hyperplane can be used to assess confidence in an assignment.
The results from this and previous (16) work highlight a need for theoretical research in several areas. As illustrated here, the generalization performance of SVMs depends not only on the precise learning problem, but also on the training and testing procedure employed. Although leave-one-out cross-validation is costly and time-consuming, it provides a reasonable estimate of the expected generalization error. In view of uncertainties in the labels assigned to samples and the small, imbalanced sample set, a relatively simple assessment of the overall performance of SVMs was utilized: the cost function used to judge accuracy was the total number of true positive and true negative assignments. Principled, sophisticated methods need to be developed for areas such as 1) selecting features in the presence of an unknown number of misclassified training examples, 2) choosing the appropriate class of kernel function and determining (near) optimal kernel parameters automatically, 3) training and evaluating a learning system that is both computationally efficient and yields biologically meaningful results, and 4) generating an integrated prediction from a set of feature relevance experts that vary in how well they perform on the classification and prediction task at hand (boosting and bagging).
Despite the aforementioned limitations, utilizing a mixture of feature relevance experts that incorporate SVMs for supervised learning problems appears to be a promising method for identifying marker genes in cancer profiling studies. This approach can be applied directly to identifying markers in transcription profiling studies addressing other discrimination problems such as those encountered in aging and responses to different doses and dose rates of xenobiotic agents such as radiation. Similarly, the technique could be used to identify marker experiments as opposed to marker genes. These ideas can be extended to molecular profiling studies in which the features monitored are not genes, but are molecules such as proteins, metabolites, and so on.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Address for reprint requests and other correspondence: I. S. Mian, Dept. of Cell and Mol. Biol., MS 74-197, Life Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Rd., Berkeley, CA 94720 (E-mail: SMian{at}lbl.gov).
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
F. Achcar, J.-M. Camadro, and D. Mestivier AutoClass@IJM: a powerful tool for Bayesian classification of heterogeneous data in biology Nucleic Acids Res., July 1, 2009; 37(suppl_2): W63 - W67. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Yuan and K.-C. Li Context-dependent clustering for dynamic cellular state modeling of microarray gene expression Bioinformatics, November 15, 2007; 23(22): 3039 - 3047. [Abstract] [Full Text] [PDF] |
||||
![]() |
H.-Q. Wang and K. Li A New Algorithm Based on Support Vectors and Penalty Strategy for Identifying Key Genes Related with Cancer Transactions of the Institute of Measurement and Control, August 1, 2006; 28(3): 263 - 273. [Abstract] [PDF] |
||||
![]() |
X. Li, S. Rao, Y. Wang, and B. Gong Gene mining: a novel and powerful ensemble decision approach to hunting for disease genes using microarray expression profiling Nucleic Acids Res., May 17, 2004; 32(9): 2685 - 2694. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Wang, J. Hayashi, W. E. Kim, and G. Serrero PC Cell-derived Growth Factor (Granulin Precursor) Expression and Action in Human Multiple Myeloma Clin. Cancer Res., June 1, 2003; 9(6): 2221 - 2228. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Nagarajan, N. Le, H. Mahoney, T. Araki, and J. Milbrandt Deciphering peripheral nerve myelination by using Schwann cell expression profiling PNAS, June 25, 2002; 99(13): 8998 - 9003. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Ambroise and G. J. McLachlan Selection bias in gene extraction on the basis of microarray gene-expression data PNAS, May 14, 2002; 99(10): 6562 - 6566. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Xiong, X. Fang, and J. Zhao Biomarker Identification by Feature Wrappers Genome Res., November 1, 2001; 11(11): 1878 - 1887. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Visit Other APS Journals Online |