## Abstract

We have devised a novel analysis approach, percentile analysis for differential gene expression (PADGE), for identifying genes differentially expressed between two groups of heterogeneous samples. PADGE was designed to compare expression profiles of sample subgroups at a series of percentile cutoffs and to examine the trend of relative expression between sample groups as expression level increases. Simulation studies showed that PADGE has more statistical power than *t*-statistics, cancer outlier profile analysis (COPA) (Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM. *Science* 310: 644–648, 2005), and kurtosis (Teschendorff AE, Naderi A, Barbosa-Morais NL, Caldas C. *Bioinformatics* 22: 2269–2275, 2006). Application of PADGE to microarray data sets in tumor tissues demonstrated its utility in prioritizing cancer genes encoding potential therapeutic targets or diagnostic markers. A web application was developed for researchers to analyze a large gene expression data set from heterogeneous biological samples and identify differentially expressed genes between subsets of sample classes using PADGE and other available approaches. Availability: http://www.cgl.ucsf.edu/Research/genentech/padge/.

- microarray
- cancer
- tumor heterogeneity

the development of microarray technologies allows biologists to study genome-wide patterns of gene expression (3, 4). One of the most common applications of these technologies is to identify genes differentially expressed between two sample groups (such as diseased vs. normal population). Widely used analytical methods, such as significance analysis of microarrays (SAM) (13) and CyberT (1), are based on *t*-statistics and are designed to compare the mean between two distributions of expression values. However, it is well known that biological samples are heterogeneous because of factors such as molecular subtypes or genetic background that are often unknown to the experimenter. As a result, important genes differentially expressed in a subset of samples can be missed by gene selection criteria based on difference in sample means.

In cancer research, a common approach for prioritizing cancer-related genes is to compare gene expression profiles between cancer and normal samples and to select genes with consistently higher expression levels in cancer samples. Such an approach ignores tumor heterogeneity and is not suitable for finding cancer genes that are overexpressed in only a subgroup of a patient population. For example, oncogene ERBB2 (HER2) is overexpressed in 15–20% of breast tumors compared with normal breast tissues within a population (7). For a population of breast tumors, the median expression level of ERBB2 shows a modest 1.3-fold elevation in tumors compared with breast tissue. However, at the 90th percentile, ERBB2 is elevated sixfold in its expression (Fig. 1). Expression level of ERBB2 is used to define breast cancer subtypes and predict drug response to targeted therapy (5, 9).

Recently, two methods, cancer outlier profile analysis (COPA) (12) and profile analysis using clustering and kurtosis (PACK) (11), were developed to identify cancer markers within tumor subclasses by measuring median absolute deviation or kurtosis of gene expression profiles. Both approaches consider a single profile of expression values from all samples of a given data set. Such approaches do not distinguish sample heterogeneity present in the whole population from those specific to cancer patients.

To facilitate identification of cancer genes and to address the issue of sample heterogeneity in gene expression analysis, we introduce “percentile analysis for differential gene expression” (PADGE), a statistical tool that compares both the magnitude and the variability between two sample groups and identifies differential expression between subsets of sample groups. PADGE can be widely used in any research areas where sample heterogeneity is a concern. A web application was developed to allow researchers to analyze a large gene expression data set from heterogeneous biological samples using PADGE as well as other available approaches. Different from alternative approaches being used for analyzing cancer gene expression, PADGE distinguishes genes that are overexpressed in a subset of a cancer group from those that have general overexpression across all cancer samples. Unlike other recently developed methods (COPA, PACK), the pairwise comparison framework of PADGE also takes into consideration variability in normal samples and explicitly searches for heterogeneous patterns of cancer gene activation specific to cancer samples.

## METHODS

For illustration purposes, the following discussion will focus on analyzing cancer microarray data sets. PADGE analyzes expression values from microarray experiments in which samples are classified into two groups, such as “normal” or “cancer.” PADGE, as well as other statistical methods discussed in this work, assumes that preprocessing steps including normalization have been conducted on the microarray data set being analyzed. In the example of ERBB2 in breast cancer shown in Fig. 1, a subset of cancer samples are overexpressed with 90th percentile expression at 15,053 vs. 2,520 in normal samples. It is notable that both normal and cancer populations contain outlier samples. To detect overexpression in a subset of cancer samples, instead of variability of expression in the whole population, we stratify cancer and normal samples separately by expression levels at various percentile cutoffs. To measure the magnitude of overexpression comparing each pair of sample subsets, we use significance level derived from statistical tests, which allows the flexibility to adopt different test statistics into the subset comparison framework. The example also shows that cancer samples are more heterogeneous than normal samples, which is manifest by the fact that the distribution of gene expression in cancer has higher variability than that in normal samples (Fig. 1, *A* and *B*). To capture the relative variability of expression between two sample groups, we calculate fold change at different percentile cutoffs and display them in a percentile plot (Fig. 1, *C* and *D*). For prioritizing genes from large-scale data sets, a summary score is designed to reflect both the magnitude of overexpression in cancer sample subsets and the relative variability of expression in cancer samples compared with normal samples. The procedure consists of three components, as follows.

#### Subset statistical tests to detect overexpression in a subset of cancer samples.

For each probe set, expression values from normal samples and cancer samples are stratified by a series of user-defined percentile cutoffs, *c*_{1}, *c*_{2}, …, *c*_{n}. Pairs of cancer and normal sample subsets are constructed with samples having expression values above each percentile cutoff. Statistical tests (e.g., *t*-test, Wilcoxon test) are selected by users to compare expression values between all samples from normal and cancer tissues as well as between each pair of sample subsets.

#### Percentile plot to visualize differential expression.

Ratios of expression values between cancer and normal samples at the corresponding percentile cutoff are used to generate a percentile plot that shows the magnitude and trend of relative expression as expression level increases in both cancer and normal samples (Fig. 1, *C* and *D*). Analogous to quantile-quantile (Q-Q) plots, percentile plots compare the distributions of normal and cancer expression, but they are more intuitive and easily interpretable by biologists.

#### Summary score to prioritize candidates.

To prioritize candidate cancer genes, we designed a summary score *S* to measure both the significance of overexpression in sample subgroups and the increase of relative expression between cancer and normal samples across percentiles. We define where *r*_{n} is the expression ratio between cancer and normal and *P*_{n} is the *P* value for subset comparison at percentile *c*_{n} after Bonferroni correction for multiple percentiles. The summary score is the maximum of the product of two terms across percentile cutoffs. The term (*r*_{n}/*r*_{1}) reflects relative variability of overexpression in cancer samples compared with normal samples, and the term −log(*P*_{n}) reflects the significance of the overexpression magnitude. *P* values derived from statistical tests are corrected for multiple hypotheses testing across all genes using false discovery rate, denoted as *q* values (10). Besides ranking genes by summary scores, PADGE allows users to specify thresholds for *q* values, ratios, and fold increase of ratios across percentiles.

## SIMULATION

To systematically assess the statistical power of PADGE and compare PADGE with alternative methods, we conducted simulation studies and used receiver operating characteristic (ROC) curves to compare the sensitivity and specificity of different methods. Expression values of 1,000 genes from 100 normal and 100 cancer samples were simulated from normal distribution with mean at 1. To allow different levels of variability among genes, standard deviation is randomly drawn for each gene from the pool of 0.2, 0.4, 0.6, 0.8, and 1. We calculated PADGE scores for 1,000 simulated genes and used the scores as null distribution to estimate the type I error rate. The cutoff score is 4.2 for a *P* value of 0.01 and 2.5 for a *P* value of 0.05 (Fig. 2). Then we added 2 units to the expression values of a subset of *k*samples from cancer for one gene and used this gene as the true positive. Sensitivity and specificity of a given method were estimated using 1,000 simulations. In each simulation, *P* value was computed as the proportion of genes with a score greater than that for the true positive. Combining all simulations, the true positive rate corresponding to a given false positive threshold was estimated as the proportion of simulations identifying the true positive gene using the false positive rate threshold (i.e., having *P* values no greater than the false positive rate). ROC curves were constructed using true and false positive rates for each method being analyzed. To assess the performance in situations with a varied subset of samples showing overexpression, a series of values, 10, 20, 50, 100, was chosen for *k* (Fig. 3 and Supplemental Table S1; supplemental data are available at the online version of this article).

We compared the performance of PADGE with *t*-statistics, Kolmogorov-Smirnov (KS) statistics, CyberT (1), SAM (13), COPA, and kurtosis. Percentile cutoffs of 50, 60, 70, 80, and 90 were used in PADGE with subset*t*-test. COPA was implemented as the 90th percentile of expression values after transformation of all data points using overall median and median absolute deviation for a given gene (12). Kurtosis was calculated using the formula for unbiased estimation (11). SAM statistic was calculated using the Bioconductor package “siggenes” (http://www.bioconductor.org/). As shown in Fig. 3, PADGE performs the best when a small subset of cancer samples show overexpression (*k* equals 10 and 20) and ties with other statistics for comparing all samples, such as *t*-statistics, KS statistics, CyberT and SAM, performing perfectly when a large proportion of cancer samples show overexpression (*k* equals 50) or all cancer samples show overexpression (*k* equals 100). COPA performs second to PADGE when *k* equals 10 and is comparable with *t*-statistics, CyberT, and SAM when *k* equals 20. However, as *k* increases to 100, COPA has no statistical power. This pattern is manifest more extremely by kurtosis, as kurtosis loses statistical power completely when *k* increases to 50 and 100. The main reason is that COPA and kurtosis are designed to detect “outlier” profiles in which a small subset of cancer samples show overexpression. Kurtosis and *t*-statistics are least similar because *t*-statistics assume normal distribution whereas kurtosis measures how much a distribution deviates from the normal distribution regarding the peak of the curve. CyberT and SAM perform similarly to *t*-statistics, as these procedures compare the mean between two distributions without explicitly considering overexpression in subsets of samples. KS statistics perform worst when *k* equals 10 and have a performance comparable with that of *t*-statistics in other cases. KS statistics compare overall differences between two groups without assumptions about the underlying distributions. However, KS statistics do not incorporate the magnitude of differences in expression and are therefore less powerful. In summary, PADGE is more powerful than other representative test statistics in detecting genes overexpressed in a small subset of cancer samples, possibly because of genetic alterations of low prevalence. PADGE performs just as well as existing methods for comparing sample means. This trend was also observed when a different standard deviation (1.5) was used during simulation (Supplemental Fig. S1). PADGE is able to detect both outlier profiles and profiles with a large portion of overexpressed samples, therefore integrating the merits of COPA and kurtosis with those of *t*-statistics.

## APPLICATION

To demonstrate the utility of PADGE, we applied the method to microarray data sets from lung adenocarcinomas (2). Raw expression data were preprocessed and scale normalized using Microarray Suite (MAS)5 from the Bioconductor package Affy (http://www.bioconductor.org/). We chose a series of cutoffs between 50th and 95th percentiles and used the Wilcoxon rank sum test. Pathway analysis using the Ingenuity tool (Ingenuity Systems, Mountain View, CA), a comprehensive curated human pathway database, showed that the top 500 selected probe sets (total 12,600) are most significantly enriched with genes involved in cancer among all functional categories in the database (143 genes; Supplemental Table S2). All these cancer genes have significant *q* values in subset comparisons (<9 × 10^{−6}), whereas only 49 have a *q* value <0.05 when all samples are included in the comparison (Supplemental Table S3). This is attributable to the fact that traditional statistical approaches for comparing two sample groups ignore sample heterogeneity. Many previously identified signature genes of adenocarcinoma subclasses were also selected, such as serine protease KLK11 and ornithine decarboxylase-1 (2). Among the top ranking genes, the trefoil factor genes TFF1 and TFF3 (rank 16 and 14) are associated with bone metastasis of breast cancer (8). These genes may increase the metastatic potential of lung adenocarcinoma cells as well and merit further investigation.

To compare the results with alternative approaches, we applied COPA using the same percentile cutoffs (the best rank was used for comparison) and calculated kurtosis and *t*-statistics of the expression profiles. From the top 500 probe sets ranked by PADGE, 498 have positive kurtosis, and the median kurtosis of these probe sets is 10, showing that the distribution of expression for PADGE-selected probe sets is non-Gaussian and has lengthened tails (Supplemental Table S3). The top 500 probe sets selected by each method were queried against the Ingenuity Pathway Database for cancer genes. PADGE has the most cancer genes among different methods (143), followed by kurtosis and *t*-statistics (117 and 112, respectively). The number of cancer genes selected by COPA is about one-half of that selected by PADGE according to Ingenuity Pathway Tool (76 vs. 143, respectively; Supplemental Table S4). The difference is mainly due to the fact that PADGE compares both the magnitude and the variability between cancer and normal expression, whereas COPA only measures the variability of all samples as a whole. As a result, PADGE identifies heterogeneous expression patterns specific to the cancer sample group instead of those common to the whole sample population. For example, epidermal growth factor receptor, a drug target in lung cancer whose overexpression is associated with poor prognosis (6), was ranked much higher by PADGE than by COPA (352 vs. 1,944, respectively). On the other hand, olfactory receptor OR2H2 was ranked toward the bottom by PADGE (11,529) but has a very high COPA rank (263) because of the heterogeneity in both normal and cancer samples (Fig. 4).

## WEB TOOL

PADGE is implemented in statistical language R, and a web application is developed using Perl common gateway interface (CGI). PADGE application and its source code are freely accessible at http://www.cgl.ucsf.edu/Research/genentech/padge/, allowing researchers to perform interactive analysis of their own data sets. For comparative analysis, all sample statistical analyses (*t*-test or Wilcoxon rank sum test), the COPA approach (12), and kurtosis (11) are also implemented, and their results are displayed side by side with PADGE, providing a resource of web-based tools for analysis of heterogeneous expression patterns (Fig. 5).

#### Data upload.

PADGE requires users to upload two tab-delimited text files. One is the data file that contains expression values. The first column of the file lists gene names, and the first row lists sample names. The other is the sample file that contains sample classification. The first column lists sample names that match exactly with those in the expression file, and the second column has class labels, such as cancer and normal. PADGE identifies overexpression in subsets of cancer samples compared with normal samples.

#### Analysis methods and parameters.

A set of parameters can be chosen by users to execute PADGE (Fig. 5). Users can define a series of percentile cutoffs to stratify sample groups for subset statistical tests. A minimal sample size of 5 is required for subset statistical tests. These percentile cutoffs are used as the *x*-axis intervals in the PADGE plots. One can choose to perform *t*-test or Wilcoxon rank sum test for sample subset comparisons. To filter gene candidates, users can select a threshold controlling the minimum *q* values obtained from subset statistical tests and/or a threshold controlling the maximum ratio between sample groups at various percentile cutoffs. For visualization, users can specify the number of top ranking genes (ranked by the summary score, defined in methods) for which a PADGE plot will be generated. Besides PADGE, users can choose to perform other methods such as *t*-test, Wilcoxon test, COPA, and kurtosis to analyze their data.

#### Analysis results.

Depending on the size of the data set, the analysis may take a while to finish. We provide automatic e-mail service to inform users about the analysis status. After the analysis is finished, users will be notified by e-mail with a unique URL linking to the result page. A link for “View result” will also appear in the interactive web session. Analysis results are presented in a tabular form with each entry linked to its PADGE plot when applicable. Users may sort the result table by a given attribute by clicking on the corresponding column name. Each gene is linked to the National Center for Biotechnology Information (NCBI) Entrez Gene page for detailed annotation. The result table is segmented into consecutive pages for interactive navigation and is also available for bulk download. Results of an example data set are available at the website (Fig. 5).

#### Conclusion.

To address heterogeneous expression patterns of biological samples, we devised the PADGE tool. PADGE examines overexpression in subsets of one sample group compared with a reference sample group. From simulation studies, PADGE outperforms multiple representative statistics designed to compare sample means when a small subset of samples show differential expression. PADGE also outperforms alternative methods developed to identify heterogeneous differential expression, such as COPA and kurtosis. We examined the utility of PADGE in analyzing cancer gene expression. In contrast to other approaches that seek to address tumor heterogeneity, PADGE uses a pairwise comparison approach that takes into consideration the variability present in both cancer and normal samples. The results can be visualized using percentile plots to examine how gene expression in cancer samples changes compared with that of normal samples at different percentiles of expression intensity. A web application is provided to researchers to access PADGE and alternative approaches for analyzing their own data sets. PADGE serves as a useful addition to the tools available for mining gene expression data sets and for identifying candidates for cancer therapeutics and diagnostics.

## Acknowledgments

We thank Tom Wu for insightful discussion, Xiaolong Yu for suggestions and help with statistical analysis, Kenneth Jung for implementation of CyberT, and William Wood and Zemin Zhang for critical review of the manuscript.

## Footnotes

Address for reprint requests and other correspondence: Z. Tang, Dept. of Bioinformatics, Genentech Inc., 1 DNA Way, South San Francisco, CA 94080 (e-mail: jtang{at}gene.com).

Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).

- Copyright © 2007 the American Physiological Society