## Abstract

Since the introduction of serial analysis of gene expression (SAGE) as a method to quantitatively analyze the differential expression of genes, several statistical tests have been published for the pairwise comparison of SAGE libraries. Testing the difference between the number of specific tags found in two SAGE libraries is hampered by the fact that each SAGE library is only one measurement: the necessary information on biological variation or experimental precision is not available. In the currently available tests, a measure of this variance is obtained from simulation or based on the properties of the tag distribution. To help the user of SAGE to decide between these tests, five different pairwise tests have been compared by determining the critical values, that is, the lowest number of tags that, given an observed number of tags in one library, needs to be found in the other library to result in a significant *P* value. The five tests included in this comparison are SAGE300, the tests described by Madden et al. (*Oncogene* 15: 1079–1085, 1997) and by Audic and Claverie (*Genome Res* 7: 986–995, 1997), Fisher’s Exact test, and the *Z* test, which is equivalent to the chi-squared test. The comparison showed that, for SAGE libraries of equal as well as different size, SAGE300, Fisher’s Exact test, *Z* test, and the Audic and Claverie test have critical values within 1.5% of each other. This indicates that these four tests will give essentially the same results when applied to SAGE libraries. The Madden test, which can only be used for libraries of similar size, is, with 25% higher critical values, more conservative, probably because the variance measure in its test statistic is not appropriate for hypothesis testing. The consequences for the choice of SAGE library sizes are discussed.

- critical values
- hypothesis test
- two-sided test
- library size
- power
- serial analysis of gene expression

serial analysis of gene expression (SAGE; 17) was introduced as a method to quantitatively analyze the differential expression of genes. The method has since been applied successfully to cells and tissues obtained from different developmental stages or resulting from a variety of pathological processes. The SAGE procedure results in a library of short tags, each representing an expressed gene. The main assumption in the interpretation of the data in this library is that every mRNA copy in the tissue has the same chance of ending up as a tag in the library. This selection of a specific tag sequence from the total pool of transcripts can be well approximated as sampling with replacement (15).

The aim of most SAGE studies is to identify genes of interest by comparing the number of specific tags found in two different SAGE libraries. In statistical terms, the aim is to reject the null hypothesis that the observed tag counts in both libraries are equal. Testing of this hypothesis is hampered by the fact that each SAGE library is only one measurement: the necessary information on biological variation and experimental precision is not available. Therefore, each of the published statistical tests for comparing SAGE libraries is based on its own assumptions about the statistical distribution of SAGE tags from which a measure of variance is obtained.

In comparing two SAGE libraries, a large number of pairwise tests, one for each specific tag, is performed. It is possible that most pairwise differences between two libraries are just the result of random sampling from two populations that do not differ. Therefore, before starting a pairwise comparison of specific tags in two libraries, the null hypothesis that the differences between libraries result from such a random sampling has to be rejected. A similar line of reasoning is applied in the comparison of the means of more than two groups: before a multiple comparison of groups can be carried out, an overall analysis of variance has to reject the null hypothesis that all groups originate from the same population (2). In the context of SAGE research, only one reference to such an overall test has been published (14). This overall test is based on a simulation of a large number of possible distributions of two libraries within the pooled marginal totals of the observed SAGE libraries. By calculating a chi-squared statistic for each simulated pair of libraries, a distribution of this statistic under the null hypothesis can be constructed. From this simulated distribution and the chi-squared statistic of the observed libraries, one can determine the probability of obtaining the observed tag distributions by chance. Rejection of the null hypothesis that all differences between SAGE libraries are just the result of random sampling then opens the way for pairwise comparisons.

In the seminal paper of Velculescu et al. (17), tag numbers in different libraries are compared pairwise with a test based on a Monte Carlo simulation of tag counts. This test has been included into the SAGE software package SAGE300 (19). SAGE300 determines for each pairwise comparison of tags the chance of obtaining a difference in tag counts equal to or greater than the observed difference from the number of trials it takes to simulate this difference 100 times. The resulting chance serves as *P* value in a one-sided test.

In other papers dealing with SAGE, several pairwise test procedures have been proposed. Most of these tests have been incorporated into public database systems and analysis programs (5, 8, 10, 11, 13, 15). The test suggested by Madden et al. (11) is based on only the number of observed specific tags in each SAGE library, and the calculated statistic (Table 1) is compared with the normal distribution. Audic and Claverie (3) derived a new equation (Table 1) for the probability, *P*(*n*_{2}|*n*_{1}), of finding *n*_{2} tags in one library given the fact that *n*_{1} tags have already been observed in the other library. The sum ∑*P*(*n*_{2}|*n*_{1}) of this probability for *n*_{2} or more tags then serves as a one-sided test. The test proposed by Kal and coworkers (7) focuses on the proportions of specific tags in each library. Since these proportions can be approximated to result from sampling with replacement, the probability of the resulting tag counts follows a binomial distribution (15). The proposed test is therefore based on the normal approximation of the binomial distribution (*Z* test; 7). The test statistic *Z* is calculated as the observed difference between proportions of specific tags in both libraries divided by the standard error of this difference when the null hypothesis is true (Table 1). This *Z* statistic is approximately normally distributed and can be compared with the critical *Z* value for the two-sided significance level α (2).

The Fisher’s Exact test has been proposed by the Cancer Genome Anatomy Project for comparison of specific tags between SAGE libraries (3, 12). Also, the chi-squared test has been used (14). Both tests are based on reorganizing the data per tag in a 2 × 2 contingency table (rows: specific and other tags; columns: *library 1* and *2*). Fisher’s Exact test calculates the pooled probability of obtaining the observed table and all tables with a more extreme difference within the row and column totals (2). The use of Fisher’s Exact test is controversial because the requirement that the row and column totals must be fixed and known prior to the experiment (5, 6) leads to conservative *P* values (1). However, the test can be used in situations such as SAGE, where the marginal totals are not naturally fixed, because the use of marginal totals is ancillary and does not lead to loss of information on the null hypothesis (1). Therefore, the Fisher’s Exact test is included in the comparison of tests. For the large numbers of tags involved in SAGE, the chi-squared test is the preferred test for 2 × 2 contingency tables (2, 6). The chi-squared test is, however, not included in the current comparison because a chi-squared test on 2 × 2 tables is statistically equivalent to the *Z* test on two proportions (2, 12) and, therefore, gives exactly the same results as the *Z* test. Two approaches based on Bayesian statistics (4, 10) have been described to calculate the probability that the level of expression of a given mRNA is increased by at least *x*-fold between libraries. Although these procedures can be used to statistically judge differences in tag numbers, their approach is clearly different from the classic approach of hypothesis testing, and results of both test procedures cannot be directly compared.

Recently the chi-squared test, Fisher’s Exact test, and the Audic and Claverie test were compared with respect to their power and robustness (12). The Madden test and SAGE300 were not included in this comparison, nor was there a comparison of the differences that are needed to lead to a statistically significant result. The latter hampers the comparison of test results in different papers. Therefore, and to further help the user of SAGE to decide between the available tests, the present review compares the critical values of five tests (excluding the chi-squared test). Critical values, sometimes called “first significant values” (3), are defined as the highest or lowest number of tags that, given an observed number of tags in one library, needs to be found in the other library to result in a *P* value below the significance level when the pairwise test is carried out.

Table 1 lists the five tests for pairwise testing of SAGE libraries that have been compared. It also gives the test statistic and the decision rule of each test. For details on the statistical basis of each of these tests, the reader is referred to the original papers. For all tests the null hypothesis (H_{0}) is that there is no difference in tag numbers between the two libraries. The five tests were compared by determining their critical values for a significance level of 0.001. Such a low significance level was chosen to safeguard against accumulation of type I error. The use of a significance level of 0.001 is equivalent to an overall significance level of 0.05 and a Bonferroni correction to allow for 50 hypothesis tests (2).

In this review only the upper critical values are considered. Critical values were determined by taking a fixed tag count in the first library and subsequently performing the statistical test for an increasing number of tags in the second library until the resulting *P* value leads to rejection of the null hypothesis at the required level of significance. Since the Monte Carlo-based test of SAGE300 does not give the same *P* value every time the same input is tested, for each input the test was run six times and the mean *P* value was used. Such an average *P* value based on three trials is also given by SAGE300 in its “analyze”-“entire project” option.

All critical values were determined for *1*) a total number of 10,000 tags in both SAGE libraries (*N*_{1} = *N*_{2} = 10,000) and *2*) a total of 10,000 tags in the first and 50,000 tags in the second library (*N*_{1} = 10,000; *N*_{2} = 50,000). The values for the number of specific tags observed in the first library (*n*_{1}) ranged from 1 to 100, effectively testing an abundance range of 0.0001 to 0.01. The critical values are the number of specific tags that have to be found in the second library (*n*_{2}) and are determined by systematic simulation of an increasing difference between the two libraries. It should be kept in mind that in most comparisons between specific tags in SAGE libraries, there is no a priori knowledge about the direction of the effect. Therefore, all pairwise tests have to be carried out as a two-sided test. To do this, the test statistic *Z* (7, 11) was compared with *Z*_{α/2}, whereas the one-sided *P* values of SAGE300 as well as the integrated probabilities of the Audic and Claverie test and of the Fisher’s Exact test were compared with α/2 (Table 1).

The upper critical values for a 0.001 level of significance for the *Z* test of Kal et al. (7) are given in Fig. 1 for two SAGE libraries of equal size (Fig. 1*A*; both 10,000 tags) and for two SAGE libraries of different size (Fig. 1*B*; 10,000 and 50,000 tags, respectively) as continuous lines. Note that for a larger SAGE library the confidence level of an observed tag count is higher (7). Therefore, with a large second SAGE library, smaller differences in proportions can be detected as statistically significant. For two libraries of the same size (*N*) and relatively low tag counts (*n*_{1} + *n*_{2} less than 1% of 2*N*) the test statistic *Z* of the *Z* test (Table 1) reduces to *Z* = (*n*_{1} − *n*_{2})/. Thus, for low tag counts and two large libraries of the same size, the critical values of the *Z* test are independent of library size.

The critical values for the test of Madden et al. (11) for two libraries of the same size are plotted in Fig. 1*A*. Compared with the critical values of the *Z* test, the Madden test requires about 25% bigger differences to reach statistical significance and is, therefore, more conservative. Although the simple mathematics of this test (Table 1) make it very easy to use, its usefulness is limited by the fact that it does not include the total number of tags in the calculations and it can, therefore, only be used for SAGE libraries of the same, or very similar, size. The origin of the test statistic of this test is not given in the original paper (11), but when one considers tag counts to fit a Poisson distribution, the variance of a tag count can be estimated to be equal to this tag count (2). The denominator of the test statistic of the Madden test (Table 1) then contains the sum of the standard deviations of the tag counts *n*_{1} and *n*_{2}. Statistics as applied by Madden effectively test the hypothesis that the difference in tag counts is zero. Therefore, one can argue that a denominator containing the standard deviation of this difference, that is, the square root of the sum of the tag counts, might be more appropriate. Note that this results in the same equation as is derived in the previous section from the test statistic of the *Z* test. For large libraries of very similar size it gives the same critical values as the *Z* test.

The test of Audic and Claverie (3), the Fisher’s Exact test, and SAGE300 (19) all have critical values that are on average within 1.5% of those of the *Z* test, for libraries of equal size (Fig. 1*A*). This equivalence of these four tests holds for tag counts as low as 1 tag per 10,000 in the first library. Only for libraries of different size and low specific tag counts, the *Z* test needs slightly higher critical values (Fig. 1*B*). Also, for other levels of significance, the critical values of the *Z* test are almost the same as those published for the Audic and Claverie test (3). This comparison of tests shows that, apart from the test of Madden et al. (11), all tests perform with similar resolution in detecting differences between SAGE libraries. Also, except for Madden et al., all tests can handle SAGE libraries of equal as well as different size. Therefore, the tests published by Kal et al. (7), Audic and Claverie (3), and Zhang et al. (19), as well as the Fisher’s Exact test, will all give the same test results when applied for pairwise comparison of SAGE libraries.

In addition, a recent paper by Man and coworkers (12) compared the chi-squared test, the test of Audic and Claverie (3), and the Fisher’s Exact test. This comparison was based on Monte Carlo simulations of SAGE libraries. The specificity, power, and robustness of the tests were determined for simulated SAGE libraries of various size and at severalfold difference. This comparison showed that the chi-squared test has consistently a higher power and is more robust than the other tests, especially at low expression levels (<15 tags/50,000). Therefore, the chi-squared test, which is equivalent to the *Z* test, was concluded to be the preferred choice for evaluating SAGE experiments (12).

The normal approximation of the binomial distribution that forms the basis of the *Z* test can also be used to easily construct confidence intervals for the observed proportion of specific tags as well as for the difference in proportions between two SAGE libraries (7). This approximation also enables the determination of the statistical power of the comparison of two SAGE libraries and the calculation of the sample size needed to detect an expected difference, both of which are essential in the planning of future SAGE analyses. A similar decision about sample size can be reached with a Monte Carlo-based program that calculates the power of a test for a given difference and sample size (POWER_SAGE; 12). Figure 2 shows a rearrangement of the equation of the *Z* test in such a way that it can be used for the evaluation and planning of SAGE experiments. In this form this equation can be used in several ways.

*1*) Given *N*_{1} and *N*_{2} (the SAGE libraries are compiled), the critical values (Fig. 3*A*) or the detectable differences (Fig. 3*B*) can be calculated for a chosen significance level (α) and power (1 − β).

*2*) Given an observed difference, the total number of tags sequenced in both libraries, and the chosen significance level, the power of the test can be determined (Fig. 4).

*3*) Given an expected difference, a significance level, a power, and the number of tags already sequenced in an existing SAGE library (*N*_{1}), the number of tags that is needed in a new library (*N*_{2}) can be calculated (Fig. 5).

*4*) Given an expected difference, a chosen significance level, and a required power, the number of tags that is needed in each library (*N*_{1} = *N*_{2}) can be calculated (Fig. 6).

The nomogram in Fig. 3*A* can be used to quickly evaluate the differences between two SAGE libraries from the laboratory or the literature. To be statistically significant (at α = 0.001) from the first library, the number of specific tags found in the second library should be above or below the appropriate pair of lines. For example, when 30 specific tags are found in the first library of 50,000 tags, a second library of 100,000 tags should yield below 20 or above 117 specific tags to be significantly repressed or upregulated, respectively. The graph of the detectable differences (Fig. 3*B*) can help the reader to plan a SAGE experiment. Consider a SAGE user who has already (from the laboratory or the literature) the information on the tag counts in a library of 50,000 tags (*N*_{1}) and plans the assembly of a second library. It can be seen from Fig. 3*B* that with increasing number of tags in this new library (*N*_{2}), smaller differences can be detected as significant. However, whereas upgrading *N*_{2} from 20,000 to 50,000 tags still gives a substantial increase in resolution, the sequencing of another 50,000 tags (*N*_{2} = 100,000) does not seem to pay off statistically. However, the chances of picking up very low abundant transcripts keep increasing with library size.

The power of a performed test tells the user how big the chance is that a real difference has been overlooked, or, in statistical terms, that a false null hypothesis is not rejected. The effect of the differences between libraries on the power of the statistical comparison of these libraries is illustrated in Fig. 4. Figure 4 shows this power as a function of the difference between a first library with 50 specific tags per 50,000 tags and second libraries of various sizes and with different numbers of specific tags. Note that the power is at its lowest when the differences in abundance are low. From this graph it can be read that when the abundance increases 1.5 times, the maximum power of the significance test will only be about 0.25: even when a second library of 100,000 tags is generated, a real 1.5-fold increase would be missed 75% of the time. To reach an acceptable power of 0.9, at least 190 tags per 100,000 should be observed. A smaller library requires relatively larger differences: at least 40 specific tags have to be observed in a library of 10,000 tags to reach the same power.

Instead of looking at detectable differences and power, one can also calculate the number of tags (*N*_{2}) needed to detect a 2- to 20-fold difference between the new library and a library known from previous work or the literature (6). The number of tags needed to observe an *x*-fold difference as significant increases exponentially with decreasing abundance of the transcripts in the first library (Fig. 5, *A* and *B*, *x*-axis) and with decreasing difference between conditions (Fig. 5, separate lines) making the detection of small differences for low abundant transcripts a practical impossibility. When the number of tags in the first library is low, differences for the low abundant transcripts may never be detectable. Because the standard error of a proportion is a function of the proportion and the library size [SE = ; Ref. 2] a small difference may never exceed the critical value. In such a case one also has to increase the size of the first library. A comparison of Fig. 5 with Fig. 6 shows that, when no prior knowledge on transcript abundance is available, the most efficient way to set up a SAGE study is to compile two SAGE libraries of equal size. For example, detecting a 10-fold difference for a gene that occurs 10 times in a library of 10,000 tags would take a second library of at least 50,000 tags (Fig. 5*A*), whereas two new libraries of both 14,000 tags would be sufficient (Fig. 6).

Other tests for pairwise comparison of SAGE libraries may be proposed in the future. The usefulness of such tests will be limited by the fact that each SAGE library, no matter how large, only represents one experimental measurement. Consequently, one has no information about the biological variation and the precision of the observed tag counts. Such a measure of experimental variance is crucial for hypothesis testing. In the currently available tests, this measure of variance is obtained from simulation (19) or based on the putative properties of the tag distribution (3, 7, 11). The test results will be dependent on the validity of these assumptions. However, the above comparison shows that the test results of SAGE300, Fisher’s Exact test, the *Z* test, and the Audic and Claverie test differ only marginally. Additional tests will, therefore, only be a significant addition to SAGE statistics when these issues of experimental variance and accuracy are addressed. Probably the modeling of the sampling error, sequencing error, and other aspects of SAGE experiments (15) may play a role in the development of such hypothesis tests and the calculation of more accurate *P* values.

When only *P* values are published, it should be noted that SAGE300 and the Audic and Claverie test, as well as the conversion from the *Z* statistic to a *P* value for the Kal test and the Madden test, will result in a one-sided *P* value. The authors should be aware of this and should mention whether a one-sided or a two-sided *P* value is tabulated (see, for instance, Ref. 8). However, since in SAGE experiments no a priori knowledge about the direction of the effects is available, the publication of two-sided *P* values would be the most appropriate and should be encouraged. This would enable the direct comparison of published *P* values with the required level of significance and simplify the comparison of different papers on the same tissues. However, the significance of the *P* value of the observed difference between tag counts should not be overemphasized: the rank order of the *P* values may well be all the information the reader needs to pinpoint important genes and to plan future research.

## Acknowledgments

We thank Drs. Arnoud Kal, Henk Tabak, and Patrick Bossuyt for help in locating the different statistical tests and Drs. Wout Lamers and Antoon Moorman for critical comments on the manuscript.

SAGE300 is available from **http://www.sagenet.org**. The test of Audic and Claverie (3) is available from **http://igs-server.cnrs-mrs.fr/~audic/significance.html**. SAGEstat, for the application of the *Z* test (7) as well as the calculation of critical values and the number of tags needed to detect an assumed difference, is available on request (E-mail: bioinfo{at}amc.uva.nl; subject, SAGEstat). An R (S-plus) implementation of SAGEstat, with the possibility to compare public domain SAGE libraries and to plot graphs of the required number of SAGE tags is incorporated in USAGE (16), which can be reached at **http://www.cmbi.kun.nl/usage/**. Another program that will calculate the number of required tags and perform a chi-squared test between SAGE libraries is POWER_SAGE (E-mail: michael.man{at}pfizer.com; Ref. 12), which is based on Monte Carlo simulations.

## Footnotes

Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).

Address for reprint requests and other correspondence: J. M. Ruijter, Dept. of Anatomy and Embryology, Academic Medical Center, Meibergdreef 15, K2-283, 1105 AZ Amsterdam, the Netherlands (E-mail: j.m.ruijter{at}amc.uva.nl).

10.1152/physiolgenomics.00042.2002.

- Copyright © 2002 the American Physiological Society