Physiol. Genomics 32: 154-159, 2007.
First published October 9, 2007; doi:10.1152/physiolgenomics.00259.2006
1094-8341/07 $8.00
Received 22 November 2006;
accepted in final form 8 October 2007.
Physiological Genomics 32:154-159 (2007)
1094-8341/07 $8.00 © 2007 American Physiological Society
Toolbox
PADGE: analysis of heterogeneous patterns of differential gene expression
Li Li1,
Amitabha Chaudhuri2,
John Chant3 and
Zhijun Tang1
1 Department of Bioinformatics, Genentech Incorporated, South San Francisco, California
2 Department of Molecular Oncology, Genentech Incorporated, South San Francisco, California
3 Department of Molecular Biology, Genentech Incorporated, South San Francisco, California
 |
ABSTRACT
|
|---|
We have devised a novel analysis approach, percentile analysis for differential gene expression (PADGE), for identifying genes differentially expressed between two groups of heterogeneous samples. PADGE was designed to compare expression profiles of sample subgroups at a series of percentile cutoffs and to examine the trend of relative expression between sample groups as expression level increases. Simulation studies showed that PADGE has more statistical power than t-statistics, cancer outlier profile analysis (COPA) (Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM. Science 310: 644–648, 2005), and kurtosis (Teschendorff AE, Naderi A, Barbosa-Morais NL, Caldas C. Bioinformatics 22: 2269–2275, 2006). Application of PADGE to microarray data sets in tumor tissues demonstrated its utility in prioritizing cancer genes encoding potential therapeutic targets or diagnostic markers. A web application was developed for researchers to analyze a large gene expression data set from heterogeneous biological samples and identify differentially expressed genes between subsets of sample classes using PADGE and other available approaches. Availability: http://www.cgl.ucsf.edu/Research/genentech/padge/.
microarray; cancer; tumor heterogeneity
 |
INTRODUCTION
|
|---|
THE DEVELOPMENT OF microarray technologies allows biologists to study genome-wide patterns of gene expression (3, 4). One of the most common applications of these technologies is to identify genes differentially expressed between two sample groups (such as diseased vs. normal population). Widely used analytical methods, such as significance analysis of microarrays (SAM) (13) and CyberT (1), are based on t-statistics and are designed to compare the mean between two distributions of expression values. However, it is well known that biological samples are heterogeneous because of factors such as molecular subtypes or genetic background that are often unknown to the experimenter. As a result, important genes differentially expressed in a subset of samples can be missed by gene selection criteria based on difference in sample means.
In cancer research, a common approach for prioritizing cancer-related genes is to compare gene expression profiles between cancer and normal samples and to select genes with consistently higher expression levels in cancer samples. Such an approach ignores tumor heterogeneity and is not suitable for finding cancer genes that are overexpressed in only a subgroup of a patient population. For example, oncogene ERBB2 (HER2) is overexpressed in 15–20% of breast tumors compared with normal breast tissues within a population (7). For a population of breast tumors, the median expression level of ERBB2 shows a modest 1.3-fold elevation in tumors compared with breast tissue. However, at the 90th percentile, ERBB2 is elevated sixfold in its expression (Fig. 1). Expression level of ERBB2 is used to define breast cancer subtypes and predict drug response to targeted therapy (5, 9).

View larger version (36K):
[in this window]
[in a new window]
|
Fig. 1. ERBB2 expression in normal and breast tumor tissues. A: histogram and box plot of expression values of ERBB2 in normal breast tissues. Vertical blue line defines expression intensity at the 90th percentile. B: histogram and box plot of expression values of ERBB2 in breast tumor tissues. Vertical blue line defines expression intensity at the 90th percentile. C: expression values of ERBB2 using Affymetrix HGU133 chip at various percentile cutoffs and fold change between cancer and normal tissues. D: percentile plot for ERBB2 comparing expression levels between normal and breast cancer tissues at a series of percentiles (data from GeneLogic, Gaithersburg, MD, and preprocessed using Microarray Suite 5).
|
|
Recently, two methods, cancer outlier profile analysis (COPA) (12) and profile analysis using clustering and kurtosis (PACK) (11), were developed to identify cancer markers within tumor subclasses by measuring median absolute deviation or kurtosis of gene expression profiles. Both approaches consider a single profile of expression values from all samples of a given data set. Such approaches do not distinguish sample heterogeneity present in the whole population from those specific to cancer patients.
To facilitate identification of cancer genes and to address the issue of sample heterogeneity in gene expression analysis, we introduce "percentile analysis for differential gene expression" (PADGE), a statistical tool that compares both the magnitude and the variability between two sample groups and identifies differential expression between subsets of sample groups. PADGE can be widely used in any research areas where sample heterogeneity is a concern. A web application was developed to allow researchers to analyze a large gene expression data set from heterogeneous biological samples using PADGE as well as other available approaches. Different from alternative approaches being used for analyzing cancer gene expression, PADGE distinguishes genes that are overexpressed in a subset of a cancer group from those that have general overexpression across all cancer samples. Unlike other recently developed methods (COPA, PACK), the pairwise comparison framework of PADGE also takes into consideration variability in normal samples and explicitly searches for heterogeneous patterns of cancer gene activation specific to cancer samples.
 |
METHODS
|
|---|
For illustration purposes, the following discussion will focus on analyzing cancer microarray data sets. PADGE analyzes expression values from microarray experiments in which samples are classified into two groups, such as "normal" or "cancer." PADGE, as well as other statistical methods discussed in this work, assumes that preprocessing steps including normalization have been conducted on the microarray data set being analyzed. In the example of ERBB2 in breast cancer shown in Fig. 1, a subset of cancer samples are overexpressed with 90th percentile expression at 15,053 vs. 2,520 in normal samples. It is notable that both normal and cancer populations contain outlier samples. To detect overexpression in a subset of cancer samples, instead of variability of expression in the whole population, we stratify cancer and normal samples separately by expression levels at various percentile cutoffs. To measure the magnitude of overexpression comparing each pair of sample subsets, we use significance level derived from statistical tests, which allows the flexibility to adopt different test statistics into the subset comparison framework. The example also shows that cancer samples are more heterogeneous than normal samples, which is manifest by the fact that the distribution of gene expression in cancer has higher variability than that in normal samples (Fig. 1, A and B). To capture the relative variability of expression between two sample groups, we calculate fold change at different percentile cutoffs and display them in a percentile plot (Fig. 1, C and D). For prioritizing genes from large-scale data sets, a summary score is designed to reflect both the magnitude of overexpression in cancer sample subsets and the relative variability of expression in cancer samples compared with normal samples. The procedure consists of three components, as follows.
Subset statistical tests to detect overexpression in a subset of cancer samples.
For each probe set, expression values from normal samples and cancer samples are stratified by a series of user-defined percentile cutoffs, c1, c2, ..., cn. Pairs of cancer and normal sample subsets are constructed with samples having expression values above each percentile cutoff. Statistical tests (e.g., t-test, Wilcoxon test) are selected by users to compare expression values between all samples from normal and cancer tissues as well as between each pair of sample subsets.
Percentile plot to visualize differential expression.
Ratios of expression values between cancer and normal samples at the corresponding percentile cutoff are used to generate a percentile plot that shows the magnitude and trend of relative expression as expression level increases in both cancer and normal samples (Fig. 1, C and D). Analogous to quantile-quantile (Q-Q) plots, percentile plots compare the distributions of normal and cancer expression, but they are more intuitive and easily interpretable by biologists.
Summary score to prioritize candidates.
To prioritize candidate cancer genes, we designed a summary score S to measure both the significance of overexpression in sample subgroups and the increase of relative expression between cancer and normal samples across percentiles. We define
where rn is the expression ratio between cancer and normal and Pn is the P value for subset comparison at percentile cn after Bonferroni correction for multiple percentiles. The summary score is the maximum of the product of two terms across percentile cutoffs. The term (rn/r1) reflects relative variability of overexpression in cancer samples compared with normal samples, and the term –log(Pn) reflects the significance of the overexpression magnitude. P values derived from statistical tests are corrected for multiple hypotheses testing across all genes using false discovery rate, denoted as q values (10). Besides ranking genes by summary scores, PADGE allows users to specify thresholds for q values, ratios, and fold increase of ratios across percentiles.
 |
SIMULATION
|
|---|
To systematically assess the statistical power of PADGE and compare PADGE with alternative methods, we conducted simulation studies and used receiver operating characteristic (ROC) curves to compare the sensitivity and specificity of different methods. Expression values of 1,000 genes from 100 normal and 100 cancer samples were simulated from normal distribution with mean at 1. To allow different levels of variability among genes, standard deviation is randomly drawn for each gene from the pool of 0.2, 0.4, 0.6, 0.8, and 1. We calculated PADGE scores for 1,000 simulated genes and used the scores as null distribution to estimate the type I error rate. The cutoff score is 4.2 for a P value of 0.01 and 2.5 for a P value of 0.05 (Fig. 2). Then we added 2 units to the expression values of a subset of ksamples from cancer for one gene and used this gene as the true positive. Sensitivity and specificity of a given method were estimated using 1,000 simulations. In each simulation, P value was computed as the proportion of genes with a score greater than that for the true positive. Combining all simulations, the true positive rate corresponding to a given false positive threshold was estimated as the proportion of simulations identifying the true positive gene using the false positive rate threshold (i.e., having P values no greater than the false positive rate). ROC curves were constructed using true and false positive rates for each method being analyzed. To assess the performance in situations with a varied subset of samples showing overexpression, a series of values, 10, 20, 50, 100, was chosen for k (Fig. 3 and Supplemental Table S1; supplemental data are available at the online version of this article).

View larger version (7K):
[in this window]
[in a new window]
|
Fig. 2. Null distribution and type I error rate of percentile analysis for differential gene expression (PADGE) score. Expression values of 1,000 genes are simulated from normal distribution. Cutoff scores corresponding to type I error rates of 1 and 5% are marked.
|
|

View larger version (23K):
[in this window]
[in a new window]
|
Fig. 3. Receiver operating characteristic (ROC) curves from simulation studies comparing PADGE with t-statistics, Kolmogorov-Smirnov (KS) test, CyberT, significance analysis of microarrays (SAM), cancer outlier profile analysis (COPA), and kurtosis (Kurt). Expression values of 1,000 genes are simulated from normal distribution. One cancer gene is overexpressed by 2 units in k of 100 cancer samples compared with 100 normal samples. ROC curves are plotted based on 1,000 simulations.
|
|
We compared the performance of PADGE with t-statistics, Kolmogorov-Smirnov (KS) statistics, CyberT (1), SAM (13), COPA, and kurtosis. Percentile cutoffs of 50, 60, 70, 80, and 90 were used in PADGE with subsett-test. COPA was implemented as the 90th percentile of expression values after transformation of all data points using overall median and median absolute deviation for a given gene (12). Kurtosis was calculated using the formula for unbiased estimation (11). SAM statistic was calculated using the Bioconductor package "siggenes" (http://www.bioconductor.org/). As shown in Fig. 3, PADGE performs the best when a small subset of cancer samples show overexpression (k equals 10 and 20) and ties with other statistics for comparing all samples, such as t-statistics, KS statistics, CyberT and SAM, performing perfectly when a large proportion of cancer samples show overexpression (k equals 50) or all cancer samples show overexpression (k equals 100). COPA performs second to PADGE when k equals 10 and is comparable with t-statistics, CyberT, and SAM when k equals 20. However, as k increases to 100, COPA has no statistical power. This pattern is manifest more extremely by kurtosis, as kurtosis loses statistical power completely when k increases to 50 and 100. The main reason is that COPA and kurtosis are designed to detect "outlier" profiles in which a small subset of cancer samples show overexpression. Kurtosis and t-statistics are least similar because t-statistics assume normal distribution whereas kurtosis measures how much a distribution deviates from the normal distribution regarding the peak of the curve. CyberT and SAM perform similarly to t-statistics, as these procedures compare the mean between two distributions without explicitly considering overexpression in subsets of samples. KS statistics perform worst when k equals 10 and have a performance comparable with that of t-statistics in other cases. KS statistics compare overall differences between two groups without assumptions about the underlying distributions. However, KS statistics do not incorporate the magnitude of differences in expression and are therefore less powerful. In summary, PADGE is more powerful than other representative test statistics in detecting genes overexpressed in a small subset of cancer samples, possibly because of genetic alterations of low prevalence. PADGE performs just as well as existing methods for comparing sample means. This trend was also observed when a different standard deviation (1.5) was used during simulation (Supplemental Fig. S1). PADGE is able to detect both outlier profiles and profiles with a large portion of overexpressed samples, therefore integrating the merits of COPA and kurtosis with those of t-statistics.
 |
APPLICATION
|
|---|
To demonstrate the utility of PADGE, we applied the method to microarray data sets from lung adenocarcinomas (2). Raw expression data were preprocessed and scale normalized using Microarray Suite (MAS)5 from the Bioconductor package Affy (http://www.bioconductor.org/). We chose a series of cutoffs between 50th and 95th percentiles and used the Wilcoxon rank sum test. Pathway analysis using the Ingenuity tool (Ingenuity Systems, Mountain View, CA), a comprehensive curated human pathway database, showed that the top 500 selected probe sets (total 12,600) are most significantly enriched with genes involved in cancer among all functional categories in the database (143 genes; Supplemental Table S2). All these cancer genes have significant q values in subset comparisons (<9 x 10–6), whereas only 49 have a q value <0.05 when all samples are included in the comparison (Supplemental Table S3). This is attributable to the fact that traditional statistical approaches for comparing two sample groups ignore sample heterogeneity. Many previously identified signature genes of adenocarcinoma subclasses were also selected, such as serine protease KLK11 and ornithine decarboxylase-1 (2). Among the top ranking genes, the trefoil factor genes TFF1 and TFF3 (rank 16 and 14) are associated with bone metastasis of breast cancer (8). These genes may increase the metastatic potential of lung adenocarcinoma cells as well and merit further investigation.
To compare the results with alternative approaches, we applied COPA using the same percentile cutoffs (the best rank was used for comparison) and calculated kurtosis and t-statistics of the expression profiles. From the top 500 probe sets ranked by PADGE, 498 have positive kurtosis, and the median kurtosis of these probe sets is 10, showing that the distribution of expression for PADGE-selected probe sets is non-Gaussian and has lengthened tails (Supplemental Table S3). The top 500 probe sets selected by each method were queried against the Ingenuity Pathway Database for cancer genes. PADGE has the most cancer genes among different methods (143), followed by kurtosis and t-statistics (117 and 112, respectively). The number of cancer genes selected by COPA is about one-half of that selected by PADGE according to Ingenuity Pathway Tool (76 vs. 143, respectively; Supplemental Table S4). The difference is mainly due to the fact that PADGE compares both the magnitude and the variability between cancer and normal expression, whereas COPA only measures the variability of all samples as a whole. As a result, PADGE identifies heterogeneous expression patterns specific to the cancer sample group instead of those common to the whole sample population. For example, epidermal growth factor receptor, a drug target in lung cancer whose overexpression is associated with poor prognosis (6), was ranked much higher by PADGE than by COPA (352 vs. 1,944, respectively). On the other hand, olfactory receptor OR2H2 was ranked toward the bottom by PADGE (11,529) but has a very high COPA rank (263) because of the heterogeneity in both normal and cancer samples (Fig. 4).

View larger version (14K):
[in this window]
[in a new window]
|
Fig. 4. A: percentile plots for epidermal growth factor receptor (EGFR) and OR2H2 in lung adenocarcinomas (2). Sorted expression levels of EGFR and OR2H2 in normal and cancer samples are shown in B and C, respectively.
|
|
 |
WEB TOOL
|
|---|
PADGE is implemented in statistical language R, and a web application is developed using Perl common gateway interface (CGI). PADGE application and its source code are freely accessible at http://www.cgl.ucsf.edu/Research/genentech/padge/, allowing researchers to perform interactive analysis of their own data sets. For comparative analysis, all sample statistical analyses (t-test or Wilcoxon rank sum test), the COPA approach (12), and kurtosis (11) are also implemented, and their results are displayed side by side with PADGE, providing a resource of web-based tools for analysis of heterogeneous expression patterns (Fig. 5).

View larger version (47K):
[in this window]
[in a new window]
|
Fig. 5. Screen shots of PADGE web application. Result pages from the example data set (2) are shown. A set of parameters can be chosen by users to execute PADGE. Results from PADGE, COPA, kurtosis, and all sample statistical analyses (t-test or Wilcoxon rank sum test) are displayed for comparison.
|
|
Data upload.
PADGE requires users to upload two tab-delimited text files. One is the data file that contains expression values. The first column of the file lists gene names, and the first row lists sample names. The other is the sample file that contains sample classification. The first column lists sample names that match exactly with those in the expression file, and the second column has class labels, such as cancer and normal. PADGE identifies overexpression in subsets of cancer samples compared with normal samples.
Analysis methods and parameters.
A set of parameters can be chosen by users to execute PADGE (Fig. 5). Users can define a series of percentile cutoffs to stratify sample groups for subset statistical tests. A minimal sample size of 5 is required for subset statistical tests. These percentile cutoffs are used as the x-axis intervals in the PADGE plots. One can choose to perform t-test or Wilcoxon rank sum test for sample subset comparisons. To filter gene candidates, users can select a threshold controlling the minimum q values obtained from subset statistical tests and/or a threshold controlling the maximum ratio between sample groups at various percentile cutoffs. For visualization, users can specify the number of top ranking genes (ranked by the summary score, defined in METHODS) for which a PADGE plot will be generated. Besides PADGE, users can choose to perform other methods such as t-test, Wilcoxon test, COPA, and kurtosis to analyze their data.
Analysis results.
Depending on the size of the data set, the analysis may take a while to finish. We provide automatic e-mail service to inform users about the analysis status. After the analysis is finished, users will be notified by e-mail with a unique URL linking to the result page. A link for "View result" will also appear in the interactive web session. Analysis results are presented in a tabular form with each entry linked to its PADGE plot when applicable. Users may sort the result table by a given attribute by clicking on the corresponding column name. Each gene is linked to the National Center for Biotechnology Information (NCBI) Entrez Gene page for detailed annotation. The result table is segmented into consecutive pages for interactive navigation and is also available for bulk download. Results of an example data set are available at the website (Fig. 5).
Conclusion.
To address heterogeneous expression patterns of biological samples, we devised the PADGE tool. PADGE examines overexpression in subsets of one sample group compared with a reference sample group. From simulation studies, PADGE outperforms multiple representative statistics designed to compare sample means when a small subset of samples show differential expression. PADGE also outperforms alternative methods developed to identify heterogeneous differential expression, such as COPA and kurtosis. We examined the utility of PADGE in analyzing cancer gene expression. In contrast to other approaches that seek to address tumor heterogeneity, PADGE uses a pairwise comparison approach that takes into consideration the variability present in both cancer and normal samples. The results can be visualized using percentile plots to examine how gene expression in cancer samples changes compared with that of normal samples at different percentiles of expression intensity. A web application is provided to researchers to access PADGE and alternative approaches for analyzing their own data sets. PADGE serves as a useful addition to the tools available for mining gene expression data sets and for identifying candidates for cancer therapeutics and diagnostics.
 |
ACKNOWLEDGMENTS
|
|---|
We thank Tom Wu for insightful discussion, Xiaolong Yu for suggestions and help with statistical analysis, Kenneth Jung for implementation of CyberT, and William Wood and Zemin Zhang for critical review of the manuscript.
 |
FOOTNOTES
|
|---|
Address for reprint requests and other correspondence: Z. Tang, Dept. of Bioinformatics, Genentech Inc., 1 DNA Way, South San Francisco, CA 94080 (e-mail: jtang{at}gene.com).
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).
 |
REFERENCES
|
|---|
- Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17: 509–519, 2001.[Abstract/Free Full Text]
- Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 98: 13790–13795, 2001.[Abstract/Free Full Text]
- DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278: 680–686, 1997.[Abstract/Free Full Text]
- Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, Brown EL. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14: 1675–1680, 1996.[CrossRef][Web of Science][Medline]
- Pegram MD, Lipton A, Hayes DF, Weber BL, Baselga JM, Tripathy D, Baly D, Baughman SA, Twaddell T, Glaspy JA, Slamon DJ. Phase II study of receptor-enhanced chemosensitivity using recombinant humanized anti-p185HER2/neu monoclonal antibody plus cisplatin in patients with HER2/neu-overexpressing metastatic breast cancer refractory to chemotherapy treatment. J Clin Oncol 16: 2659–2671, 1998.[Abstract]
- Selvaggi G, Novello S, Torri V, Leonardo E, De Giuli P, Borasio P, Mossetti C, Ardissone F, Lausi P, Scagliotti GV. Epidermal growth factor receptor overexpression correlates with a poor prognosis in completely resected non-small-cell lung cancer. Ann Oncol 15: 28–32, 2004.[Abstract/Free Full Text]
- Slamon DJ, Clark GM, Wong SG, Levin WJ, Ullrich A, McGuire WL. Human breast cancer: correlation of relapse and survival with amplification of the HER-2/neu oncogene. Science 235: 177–182, 1987.[Abstract/Free Full Text]
- Smid M, Wang Y, Klijn JG, Sieuwerts AM, Zhang Y, Atkins D, Martens JW, Foekens JA. Genes associated with breast cancer metastatic to bone. J Clin Oncol 24: 2261–2267, 2006.[Abstract/Free Full Text]
- Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, Demeter J, Perou CM, Lonning PE, Brown PO, Borresen-Dale AL, Botstein D. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci USA 100: 8418–8423, 2003.[Abstract/Free Full Text]
- Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100: 9440–9445, 2003.[Abstract/Free Full Text]
- Teschendorff AE, Naderi A, Barbosa-Morais NL, Caldas C. PACK: Profile Analysis using Clustering and Kurtosis to find molecular classifiers in cancer. Bioinformatics 22: 2269–2275, 2006.[Abstract/Free Full Text]
- Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310: 644–648, 2005.[Abstract/Free Full Text]
- Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98: 5116–5121, 2001.[Abstract/Free Full Text]
Copyright © 2007 by the American Physiological Society.