## Abstract

Gene expression analysis using high-throughput microarray technology has become a powerful approach to study systems biology. The exponential growth in microarray experiments has spawned a number of investigations into the reliability and reproducibility of this type of data. However, the sample size requirements necessary to obtain statistically significant results has not had as much attention. We report here statistical methods for the determination of the sufficient number of subjects necessary to minimize the false discovery rate while maintaining high power to detect differentially expressed genes. Two experimental designs were considered: *1*) a comparison between two groups at a single time point, and *2*) a comparison of two experimental groups with sequential time points. Computer programs are available for the methods discussed in this paper and are adaptable to more complicated situations.

- gene expression
- statistical analysis
- functional genomics

high-throughput microarray technologies have created unparalleled opportunities to study the mechanism of disease, monitor disease progression, and evaluate effective therapies. Because tens of thousands of genes are investigated simultaneously with a technology that has not been perfected, there is a large uncertainty in the decision process. Therefore, one must rely on statistical analyses of the data to draw conclusions from microarray experiments. A number of methods have been developed to extract useful and robust information from microarray data sets once the hybridizations are complete. However, not as much attention has been paid to the experimental design before the investigator actually completes the experiment.

One of the principal questions while designing a microarray experiment is: What is the sample size required for a microarray experiment to achieve a user defined statistical threshold? To achieve this goal, one must decide how the data will be analyzed and what conclusions can be made. Since microarrays can be used to answer a variety of questions, different experiments will require different designs and analyses. In this study, we will concentrate on finding expression differences between two treatment groups. The goals of the experiments can be classified into three categories: *1*) comparing two groups at a single time point, *2*) examining the treatment effect over time when the without treatment effect is known, and *3*) comparing two groups at sequential time points. We will investigate the sample size requirements necessary to satisfy a specified false discovery rate (FDR) and power (1 − β) given various biological and experimental variation.

Although there are other reports on how to compute the sample size, none of them is as simple and straightforward as the methods presented here. In our experience, we often observe a large number of genes that show expression differences in a typical microarray analysis. The key is that we want to make sure that only a small number in these selected genes are false positives (discoveries). A brief survey of some of the other methods is in the discussion.

## METHODS

cDNA microarrays are printed on glass microscope slides using established protocols (15). The analysis of microarray data can be separated into several stages. The first stage is to eliminate spots with very weak signal, which we define as the signal that is indistinguishable from the background noise. Yang et al. (17) demonstrated that any signal less than 3 standard deviations of the background noise should be discarded. Due to the variation in printing material on the cDNA arrays, the expression is more accurately measured by the ratio of sample intensity vs. intensity from a common reference (5, 12). To derive the ratio, the intensities from both channels have to be normalized. Although there are a number of methods to normalize the two channels, Yang et al. (17) determined that there was actually a very small difference among the different normalization methods (10) when spots with sufficient signal are chosen. In this study, we assume that the analysis will be done after flagging for weak spots, log-transformation, and normalization. Given these assumptions, the useful data is now *y*_{ijtk}, where *i* is the treatment index, *j* is the subject index, *t* is the time index, and *k* is the replicate index. A few assumptions have to be made in order to obtain the analytic results for sample size determination. The most general model in this study is (1) where μ_{it} is the mean expression level under the *i*th treatment at *time t*, δ_{ijt} represents the subject variation, and ε_{ijtk} is the experimental variation. Throughout this paper, μ_{it} is a fixed constant (or fixed effect), and δ_{ijt} is a random sample from subjects under treatment *i*, or it is a random effect. Some of the indices can also be eliminated when they are not relevant. For example, when the data is not measured over time, the index *t* is eliminated, and when there is only one treatment, the index *i* is omitted.

### Adjustment of the threshold significance level.

As in the usual hypothesis testing, a significance level α is assigned. When there is only one hypothesis, this guarantees that the chance of incorrectly rejecting the null hypothesis is α. When many hypotheses are tested simultaneously, α cannot be considered as the significance level for rejecting all the true null hypotheses. The level for each individual test has to be adjusted. One option for adjusting the test is to use the Bonferroni inequality, i.e., to set every test at the new significant level α/*n*_{g}, where *n*_{g} is the number of hypotheses tested simultaneously. We use the notation *n*_{g} because this number is the number of genes in the array. However, when *n*_{g} is very large, the Bonferroni adjustment is too conservative because it is meant to control family-wise error rate (1), i.e., to control the probability of one type I error among thousands of hypotheses. In microarray experiments, it is important to find the best compromise between maximizing the power of detection and minimizing false-positive identification. Very often, although not always, additional follow-up research such as database search and real-time PCR will be done to further eliminate false-positive results (6). Thus the FDR defined as the expected proportion of wrong discovery (1) fits this purpose better. At the sample size determination stage, we need to know approximately the minimum number of genes that will be accepted under a given selection rule. An experimenter may know, from past experience, that at least *n*_{0} genes will be accepted, then the adjusted significance level for each gene is (2) to guarantee an α FDR. The derivation is given in the appendix i. The value *n*_{0} has to be determined by experience or a pilot experiment. In our experience, there are always a considerable number of genes that will show different expressions. If there is no way to determine *n*_{0}, then the experimenter may let *n*_{0} = 0 and determine the most conservative (largest) sample size. Sometimes it may be necessary to do a pilot study to determine *n*_{0} and other variance components in the model. For our laboratory, it is common practice to do a pilot study to get an idea of these parameters and then determine the sample size for the required power.

At the data analysis stage, other methods such as permutation tests (4, 16) may be used, but their power are not as easy to compute at the design stage.

### Single time point comparison between two treatment groups.

For the case of a comparing between treatments at one time point, one can drop the *t* index so *Eq. 1* becomes (3) where *i* = 1, 2; *j* = 1, …, *n*_{i}; *k* = 1, …, *m*; δ_{ij} is the between subjects variation, and ε_{ijk} is experimental variation. When *m* > 1, replicates are used to reduce the intermicroarray variation under the same biological condition. We assume that both of δ_{ij} and ε_{ijk} are normally distributed with zero mean and standard deviation (SD) ς_{s} and ς_{e}, respectively.

The hypotheses are (4) for each gene. After the data has been collected, the mean and variance for each gene is defined by (5) (6) and the decision rule is to accept H_{1} at significance level α′ when (7) where α′ is defined by *Eq. 2*, ν = *n*_{1} + *n*_{2} − 2, and *t*_{ν,α′} is the upper α′ quantile of the *t* distribution with ν degrees of freedom. Other types of tests such as unequal variance *t*-test or a Mann-Whitney test can also be used, but the results should be very similar (18). To determine the sample size, we need to specify the power (1 − β) at a predetermined level of significant gene expression difference |μ_{1} − μ_{2}|. In traditional statistics, what we need is the value (8) and the variance ratio (9) It is often difficult to determine what Δ constitutes a significant expression difference. We believe that the overlap concept can make the interpretation of the alternative easier. When the alternative hypothesis is true, we have |μ_{1} − μ_{2}| > 0. Naturally, any expression closer to μ_{1} will be classified to *treatment 1* and vice versa. Let (10) The term *w* is the proportion of subjects with expression that appears to belong to the other treatment group. Under the model in *Eq. 3*, the overlapping proportion is symmetric, i.e., the proportions that violate the treatment assignment are the same in both treatments. The relation between the traditional Δ in *Eq. 8* and *w* in *Eq. 10* is (11) where Φ(*x*) is the standard normal distribution function. Although Δ and *w* are equivalent and can be easily converted from one to another, it is our opinion that the idea of overlapping between groups has an easier interpretation. A historical survey and recent developments in using overlapping for sample size determination have been published (3, 8, 14). Hwang et al. (9) has also used the same idea in microarray analysis. To determine the sample size based on *Eqs. 7* and *11*, we use the noncentral F distribution (see appendix ii).

### Serial time points with one treatment group.

This experimental design will measure gene changes under one treatment over time, while the gene expression under no treatment condition is known. Let the time points be 1, 2, …, *T*. Since there is only one treatment, the index *i* can be dropped, and *Eq. 1* becomes (12) where δ_{jt} represents the subject variation. Under the null hypothesis, when there is no treatment, we should have i.e., gene expression is not affected by the treatment. The test for this hypothesis would be a two-way analysis of variance with time and subject as two discrete factors with the subjects being a random effect. The alternative is that H_{0} is not true, i.e., gene expression has been changed. To determine the sample size, we need to specify the pattern of {μ_{t}} under H_{1}. Moreover, we need to know how this pattern varies among subjects. We use a simple linear expression by letting (13) where γ = 0 under H_{0} (γ ≠ 0 under H_{1}), and the δ_{j} terms represent the variation of the subjects response, which are assumed to be independent and identically distributed normal random variables with 0 mean and variance ς_{γ}^{2}(*N*(0,ς_{γ}^{2})) and ε_{jtk} ∼ *N*(0,ς_{e}^{2}). The reason for this alternative model is that with a linear time effect, the slope usually changes with the subjects. However, the linear relationship is not crucial, because when time is considered discrete, *Eq. 13* can be a good approximation to a quadratic change. For example, the pattern in Fig. 1, *left*, can be expressed linearly when time is reordered. We have tried to use specific alternatives such as a quadratic model (which covers the linear model) to increase the power of the test. However, the increase in power is very limited. One could also make the model more general by adding correlations between time points. But during the design stage, this correlation is difficult to ascertain. The random effect, δ_{j}, in the slope should be enough to rep resent the intra-subject correlation.

Again, we will use the overlap concept to specify the alternative. In the model in *Eq. 13*, suppose the alternative is γ > 0. (The case γ < 0 works almost identically.) We let *w* = the proportion of subjects (in the population) that show the opposite (γ + δ_{j} < 0) mRNA expression. The power can be computed by simulation using SAS PROC MIXED, i.e., we simulate the data a large number of times under the alternative and use the proportion of the times that H_{1} is accepted as the power estimate. Note that the power is not determined by *w* alone. It is also a function of *T*. Details are given in the results and appendix ii.

### Serial time points with two treatment groups.

This experiment will measure the gene expression changes over time from two treatment groups or from one treatment and one control. The model is now the full model in *Eq. 1* with the null hypothesis and the general alternative that H_{0} is not true. This model is based on two factors, time and treatment, and subjects are nested in the treatments. The experiment is to test the treatment effect or the treatment-time interaction. We will not consider the treatment-time interaction in the design stage. To specify an alternative for the sample size determination, more parameters are involved. We follow the linear approximation in *Eq. 13* with (14) where the δ_{ij} terms share the same assumptions as those in *Eq. 13*. Again we can define the alternative by using the overlap idea with *w* being equal to the proportion of subjects that have the opposite slope (γ + δ_{ij} < 0) in mRNA expression when it belongs to the treatment group with γ > 0. This model can be expanded to include other alternatives such as the change of intercept. All of these modifications can be easily embedded into the SAS program mentioned in appendix iii.

The details of power computation are again given in appendix ii.

## RESULTS

### For a single time point.

Figure 2 presents the sample size requirements for a microarray experiment with 5,000 genes (*n*_{g}) assayed with an FDR of 0.05 (α) and power = 0.85 when *n*_{0} = 0, 5, 20, and 50. The sample sizes for the two treatments are assumed to be equal (i.e., *n*_{1} = *n*_{2} = *n*) and the number of replicates equals 1 (*m*). The variance ratio (*r*) defined in *Eq. 9* varies from 0 to 1. For example, if *n*_{g} = 5,000, *r* = 0.25, and a 15% or less overlap between genes from the two treatments is considered significant, then 22 subjects per treatment group is required to reach a power of 0.85 (*n*_{0} = 0), but only 15 subjects if we expect at least 50 genes will show significance. However, Fig. 2 is used only as a demonstration of the change of sample size under various conditions. The software given in appendix iii will produce the required sample size under any significance level, power, and other parameter values.

### For serial time points.

Although Fig. 2 covers some useful experimental cases, it is more difficult to provide examples of general interest in serial time point measurements due to the large number of parameters. The power depends not only on the biological/experimental variations as well as the number of time points (*T*), but also the time unit of *T*. For example, for a fixed slope γ, the relative size of γ*T* to the experimental standard deviation plays a crucial role. In other words, seven points in 1 day will be different from seven points in a week with the same growth rate γ in *Eqs. 13* and *14*. Thus the variance ratio (*r*) alone is not sufficient to determine the sample size.

Figure 3 shows one example on the sample size required for detecting mRNA expression changes over time. We used the data from Yang et al. (17). In that study, we found a ς_{e} ≈ 0.40 for log_{2}-transformed ratios between the sample and reference. If one assumes *n*_{g} = 5,000, ς_{γ} = 0.5, and 5% or less overlap between the two conditions (effect size = 3.29), and a twofold change in expression at the end of the time series (γ*T* = 1), then the sample size curves for the one-sample and two-sample designs can be obtained. The Bonferroni correction (*n*_{0} = 0) and expected 1% of genes to be selected (*n*_{0} = 50) for replicate sizes *m* = 1, 2, and 4 are shown. The FDR is set at 0.05, and the power is set at 0.80.

Note that in this design, the end time point is fixed. Thus an increase in the number of time points can merely add points between the predefined start and end points. Since the highest difference happens at the end point, there is only a small increase in power by observing more than four time points. It is also quite intuitive that *T* can compensate for the number of replicates (*m*), because both of them increase the information for the treatment effect when the treatment is effective.

## DISCUSSION

This paper provides a reasonable way to estimate the sample size for microarray experiments. The analysis methods are chosen to be simple in order to make the unknown parameters as few as possible. More sophisticated methods may be used after the data has been collected, but based on our experience, different analysis methods usually produce similar results for the prominent gene effects.

The computer software is very easy to comprehend. Thus a user can easily modify the programs when other tests are more appropriate. For example, if treatment-time interactions are a main concern, then this term can be easily added to the SAS procedure. The alternative under other models, if known, can also be easily generated using the simulation program with simple modifications.

The sample size determination discussed in this paper depends on the expected number of genes *n*_{0} to be accepted after the experiment. As we mentioned before, if there is no way to determine this number, then the experimenter may choose to use the most conservative number *n*_{0} = 0. Optimal sample size depends on experience and the validity of certain assumptions. Of course, there is a certain possibility of failure that an experimenter must face. Even to determine the sample size for the simplest two-sample *t*-test, we must assume that the two treatments have the same variance. There is no guarantee on this before the experiment is done. Similarly, in most microarray experiments, a large *n*_{0} can be expected. If the result shows the assumed *n*_{0} is too large, then we must admit our misjudgment and may choose to do more experiments to increase the power. This strategy is usually more economical than choosing a large sample size by always assuming the worst scenario.

Zien at al. (18) have presented a method similar to ours. In addition to the *t*-test, they also mentioned the Mann-Whitney nonparametric test and a model with additive noise. However, they did not address serial time points nor the use of the FDR in the analysis. Their results are based on a simulation which is not as accurate as the SAS output based on a noncentral F distribution. Black and Doerge (2) also used the *t*-test for the power calculation, but they considered only the number of replications with a Bonferroni correction. Subject level variation and FDR were not considered. Pan et al. (13) discussed sample size based on the mixture model. However, under the null hypothesis when all the means are zero, the mixture model is difficult to identify. Thus it may not be worth the effort to go through such a complex model building procedure at the design stage. The paper by Hwang et al. (9) tried to determine the sample size for finding the best discriminant function for the data similar to those given in Ref. 7. But the goal of finding the discriminant function is different from identifying genes with different expressions. The use of two highly correlated genes adds little discriminant power, but both of them are important to be identified.

## APPENDIX I

### Adjustment of the Significant Level Based on the False Discovery Rate

In single hypothesis testing, the significance level α can be also interpreted as the false discovery rate, FDR, i.e., In multiple hypotheses testing, if all the null hypotheses are true, then the Bonferroni correction is also the same as the FDR. However, if at least *n*_{0} > 0 genes are expected to be accepted, then Moreover where α′ is the adjusted significance level used in each test and *n*_{g}′ is the number of genes with no expression differences. When *n*_{0} = 0, it means that there may be no gene accepted. If this is true, then there is no false discovery. However, *n*_{0} is only the expected number; there may be genes accepted after the experiment. Since the smallest number is 1, α′ is the Bonferroni correction. This gives the max(1, *n*_{0}) in *Eq. 2*.

## APPENDIX II

### Data Analysis and Power Computation

#### Two sample one time point per subject.

The data analysis is based on *Eq. 7*. For a given *w*, Δ = 2Φ^{−1}(1 − *w*), and the power is where F_{ν}^{1}(ξ,λ^{2}) represents the F distribution of 1, ν = *n*_{1} + *n*_{2} − 2 degrees of freedom with noncentrality parameter λ^{2} evaluated at the threshold ξ = F^{1}_{ν, α′}, which is the upper α′ quantile of the F distribution with 1, ν = *n*_{1} + *n*_{2} − 2 degrees of freedom, and with *r* defined by *Eq. 9*. Since *r* and *m* are confounded, we may let *m* = 1 in the computer program by changing *r*.

#### One sample serial time points experiment.

The analysis is based on the mixed model with time as a fixed effect and the subject as a random effect. The SAS procedure is given in the software.

For the power simulation, data are generated by (15) where *k* in *Eq. 12* is embedded into *y*_{jt} and ε_{jt} so they are averages over *m* values, and δ_{j} is a normal random deviate with 0 mean and variance ς_{γ}, ε_{jt} are standard normal deviates, and The generated data is then tested by the mixed model procedure for the significance of the time effect. The proportion of the rejected samples are the estimated power of this test. The mixed model takes care of the random effect from subjects. Users may refer to a SAS example in Ref. 11 (p. 90) to see how SAS program is written.

#### Two sample serial time points experiment.

This analysis is based on a mixed model with time and treatment as fixed effects and the subject as a random effect.

Without loss of generality, we may let *m* = 1 and μ_{1t} = μ_{2t} = 0. For a given *T* and *w*, data are generated by (16) where δ_{ij} and ε_{ijt} are generated by the same procedure in *Eq. 15* with γ_{1} = 0 and The power is again estimated the same way as the one-treatment case.

## APPENDIX III

### Software

The software is located at **http://web.stat.ufl.edu/∼yang/microarray** for the microarray sample size determination.

#### Program one_time.sas.

This subroutine contains the SAS program for one time point study based on *Eq. 3*.

#### Program sequent_1.sas.

This subroutine contains the SAS program for one sample serial time points measurements based on *Eqs. 12* and *13*.

#### Program sequent_2.sas.

This subroutine contains the SAS program for one sample serial time points measurements based on *Eqs. 1* and *14*.

All the input instructions are given by the comments in the program. Examples corresponding to the sample sizes in Fig. 2 and Fig. 3 are in the input (∗.sas) and the output (∗.lst) programs.

## Footnotes

Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).

Address for reprint requests and other correspondence: J.-X. She, Center for Biotechnology and Genomic Medicine, Medical College of Georgia, 1120 15th St., PV6B108, Augusta, GA 30912 (E-mail: jshe{at}mail.mcg.edu); or M. C. K. Yang, Dept. of Statistics, Univ. of Florida, Gainesville, FL 32611 (E-mail: yang{at}stat.ufl.edu).

10.1152/physiolgenomics.00037.2003.

- Copyright © 2003 the American Physiological Society