Microarray experimental design: power and sample size considerations

M. C. K. Yang, J. J. Yang, R. A. McIndoe, J. X. She


Gene expression analysis using high-throughput microarray technology has become a powerful approach to study systems biology. The exponential growth in microarray experiments has spawned a number of investigations into the reliability and reproducibility of this type of data. However, the sample size requirements necessary to obtain statistically significant results has not had as much attention. We report here statistical methods for the determination of the sufficient number of subjects necessary to minimize the false discovery rate while maintaining high power to detect differentially expressed genes. Two experimental designs were considered: 1) a comparison between two groups at a single time point, and 2) a comparison of two experimental groups with sequential time points. Computer programs are available for the methods discussed in this paper and are adaptable to more complicated situations.

  • gene expression
  • statistical analysis
  • functional genomics

high-throughput microarray technologies have created unparalleled opportunities to study the mechanism of disease, monitor disease progression, and evaluate effective therapies. Because tens of thousands of genes are investigated simultaneously with a technology that has not been perfected, there is a large uncertainty in the decision process. Therefore, one must rely on statistical analyses of the data to draw conclusions from microarray experiments. A number of methods have been developed to extract useful and robust information from microarray data sets once the hybridizations are complete. However, not as much attention has been paid to the experimental design before the investigator actually completes the experiment.

One of the principal questions while designing a microarray experiment is: What is the sample size required for a microarray experiment to achieve a user defined statistical threshold? To achieve this goal, one must decide how the data will be analyzed and what conclusions can be made. Since microarrays can be used to answer a variety of questions, different experiments will require different designs and analyses. In this study, we will concentrate on finding expression differences between two treatment groups. The goals of the experiments can be classified into three categories: 1) comparing two groups at a single time point, 2) examining the treatment effect over time when the without treatment effect is known, and 3) comparing two groups at sequential time points. We will investigate the sample size requirements necessary to satisfy a specified false discovery rate (FDR) and power (1 − β) given various biological and experimental variation.

Although there are other reports on how to compute the sample size, none of them is as simple and straightforward as the methods presented here. In our experience, we often observe a large number of genes that show expression differences in a typical microarray analysis. The key is that we want to make sure that only a small number in these selected genes are false positives (discoveries). A brief survey of some of the other methods is in the discussion.


cDNA microarrays are printed on glass microscope slides using established protocols (15). The analysis of microarray data can be separated into several stages. The first stage is to eliminate spots with very weak signal, which we define as the signal that is indistinguishable from the background noise. Yang et al. (17) demonstrated that any signal less than 3 standard deviations of the background noise should be discarded. Due to the variation in printing material on the cDNA arrays, the expression is more accurately measured by the ratio of sample intensity vs. intensity from a common reference (5, 12). To derive the ratio, the intensities from both channels have to be normalized. Although there are a number of methods to normalize the two channels, Yang et al. (17) determined that there was actually a very small difference among the different normalization methods (10) when spots with sufficient signal are chosen. In this study, we assume that the analysis will be done after flagging for weak spots, log-transformation, and normalization. Given these assumptions, the useful data is now yijtk, where i is the treatment index, j is the subject index, t is the time index, and k is the replicate index. A few assumptions have to be made in order to obtain the analytic results for sample size determination. The most general model in this study is Math(1) where μit is the mean expression level under the ith treatment at time t, δijt represents the subject variation, and εijtk is the experimental variation. Throughout this paper, μit is a fixed constant (or fixed effect), and δijt is a random sample from subjects under treatment i, or it is a random effect. Some of the indices can also be eliminated when they are not relevant. For example, when the data is not measured over time, the index t is eliminated, and when there is only one treatment, the index i is omitted.

Adjustment of the threshold significance level.

As in the usual hypothesis testing, a significance level α is assigned. When there is only one hypothesis, this guarantees that the chance of incorrectly rejecting the null hypothesis is α. When many hypotheses are tested simultaneously, α cannot be considered as the significance level for rejecting all the true null hypotheses. The level for each individual test has to be adjusted. One option for adjusting the test is to use the Bonferroni inequality, i.e., to set every test at the new significant level α/ng, where ng is the number of hypotheses tested simultaneously. We use the notation ng because this number is the number of genes in the array. However, when ng is very large, the Bonferroni adjustment is too conservative because it is meant to control family-wise error rate (1), i.e., to control the probability of one type I error among thousands of hypotheses. In microarray experiments, it is important to find the best compromise between maximizing the power of detection and minimizing false-positive identification. Very often, although not always, additional follow-up research such as database search and real-time PCR will be done to further eliminate false-positive results (6). Thus the FDR defined as the expected proportion of wrong discovery (1) fits this purpose better. At the sample size determination stage, we need to know approximately the minimum number of genes that will be accepted under a given selection rule. An experimenter may know, from past experience, that at least n0 genes will be accepted, then the adjusted significance level for each gene is Math(2) to guarantee an α FDR. The derivation is given in the appendix i. The value n0 has to be determined by experience or a pilot experiment. In our experience, there are always a considerable number of genes that will show different expressions. If there is no way to determine n0, then the experimenter may let n0 = 0 and determine the most conservative (largest) sample size. Sometimes it may be necessary to do a pilot study to determine n0 and other variance components in the model. For our laboratory, it is common practice to do a pilot study to get an idea of these parameters and then determine the sample size for the required power.

At the data analysis stage, other methods such as permutation tests (4, 16) may be used, but their power are not as easy to compute at the design stage.

Single time point comparison between two treatment groups.

For the case of a comparing between treatments at one time point, one can drop the t index so Eq. 1 becomes Math(3) where i = 1, 2; j = 1, …, ni; k = 1, …, m; δij is the between subjects variation, and εijk is experimental variation. When m > 1, replicates are used to reduce the intermicroarray variation under the same biological condition. We assume that both of δij and εijk are normally distributed with zero mean and standard deviation (SD) ςs and ςe, respectively.

The hypotheses are Math(4) for each gene. After the data has been collected, the mean and variance for each gene is defined by Math(5) Math(6) and the decision rule is to accept H1 at significance level α′ when Math(7) where α′ is defined by Eq. 2, ν = n1 + n2 − 2, and tν,α′ is the upper α′ quantile of the t distribution with ν degrees of freedom. Other types of tests such as unequal variance t-test or a Mann-Whitney test can also be used, but the results should be very similar (18). To determine the sample size, we need to specify the power (1 − β) at a predetermined level of significant gene expression difference |μ1 − μ2|. In traditional statistics, what we need is the value Math(8) and the variance ratio Math(9) It is often difficult to determine what Δ constitutes a significant expression difference. We believe that the overlap concept can make the interpretation of the alternative easier. When the alternative hypothesis is true, we have |μ1 − μ2| > 0. Naturally, any expression closer to μ1 will be classified to treatment 1 and vice versa. Let Math(10) The term w is the proportion of subjects with expression that appears to belong to the other treatment group. Under the model in Eq. 3, the overlapping proportion is symmetric, i.e., the proportions that violate the treatment assignment are the same in both treatments. The relation between the traditional Δ in Eq. 8 and w in Eq. 10 is Math(11) where Φ(x) is the standard normal distribution function. Although Δ and w are equivalent and can be easily converted from one to another, it is our opinion that the idea of overlapping between groups has an easier interpretation. A historical survey and recent developments in using overlapping for sample size determination have been published (3, 8, 14). Hwang et al. (9) has also used the same idea in microarray analysis. To determine the sample size based on Eqs. 7 and 11, we use the noncentral F distribution (see appendix ii).

Serial time points with one treatment group.

This experimental design will measure gene changes under one treatment over time, while the gene expression under no treatment condition is known. Let the time points be 1, 2, …, T. Since there is only one treatment, the index i can be dropped, and Eq. 1 becomes Math(12) where δjt represents the subject variation. Under the null hypothesis, when there is no treatment, we should have Math i.e., gene expression is not affected by the treatment. The test for this hypothesis would be a two-way analysis of variance with time and subject as two discrete factors with the subjects being a random effect. The alternative is that H0 is not true, i.e., gene expression has been changed. To determine the sample size, we need to specify the pattern of {μt} under H1. Moreover, we need to know how this pattern varies among subjects. We use a simple linear expression by letting Math(13) where γ = 0 under H0 (γ ≠ 0 under H1), and the δj terms represent the variation of the subjects response, which are assumed to be independent and identically distributed normal random variables with 0 mean and variance ςγ2(N(0,ςγ2)) and εjtkN(0,ςe2). The reason for this alternative model is that with a linear time effect, the slope usually changes with the subjects. However, the linear relationship is not crucial, because when time is considered discrete, Eq. 13 can be a good approximation to a quadratic change. For example, the pattern in Fig. 1, left, can be expressed linearly when time is reordered. We have tried to use specific alternatives such as a quadratic model (which covers the linear model) to increase the power of the test. However, the increase in power is very limited. One could also make the model more general by adding correlations between time points. But during the design stage, this correlation is difficult to ascertain. The random effect, δj, in the slope should be enough to rep resent the intra-subject correlation.

Fig. 1.

Any reaction where time is considered as a discrete variable can be approximated by a linear function. Note the time points on the right are reordered from the times on the left.

Again, we will use the overlap concept to specify the alternative. In the model in Eq. 13, suppose the alternative is γ > 0. (The case γ < 0 works almost identically.) We let w = the proportion of subjects (in the population) that show the opposite (γ + δj < 0) mRNA expression. The power can be computed by simulation using SAS PROC MIXED, i.e., we simulate the data a large number of times under the alternative and use the proportion of the times that H1 is accepted as the power estimate. Note that the power is not determined by w alone. It is also a function of T. Details are given in the results and appendix ii.

Serial time points with two treatment groups.

This experiment will measure the gene expression changes over time from two treatment groups or from one treatment and one control. The model is now the full model in Eq. 1 with the null hypothesis Math and the general alternative that H0 is not true. This model is based on two factors, time and treatment, and subjects are nested in the treatments. The experiment is to test the treatment effect or the treatment-time interaction. We will not consider the treatment-time interaction in the design stage. To specify an alternative for the sample size determination, more parameters are involved. We follow the linear approximation in Eq. 13 with Math(14) where the δij terms share the same assumptions as those in Eq. 13. Again we can define the alternative by using the overlap idea with w being equal to the proportion of subjects that have the opposite slope (γ + δij < 0) in mRNA expression when it belongs to the treatment group with γ > 0. This model can be expanded to include other alternatives such as the change of intercept. All of these modifications can be easily embedded into the SAS program mentioned in appendix iii.

The details of power computation are again given in appendix ii.


For a single time point.

Figure 2 presents the sample size requirements for a microarray experiment with 5,000 genes (ng) assayed with an FDR of 0.05 (α) and power = 0.85 when n0 = 0, 5, 20, and 50. The sample sizes for the two treatments are assumed to be equal (i.e., n1 = n2 = n) and the number of replicates equals 1 (m). The variance ratio (r) defined in Eq. 9 varies from 0 to 1. For example, if ng = 5,000, r = 0.25, and a 15% or less overlap between genes from the two treatments is considered significant, then 22 subjects per treatment group is required to reach a power of 0.85 (n0 = 0), but only 15 subjects if we expect at least 50 genes will show significance. However, Fig. 2 is used only as a demonstration of the change of sample size under various conditions. The software given in appendix iii will produce the required sample size under any significance level, power, and other parameter values.

Fig. 2.

The required sample sizes under various alternatives based on the overlap ratio w defined by Eq. 11 (or equivalently effect size Δ) and variance ratio r by Eq. 9 under various assumptions on the number of possible selected genes (n0). The curves from the top to bottom are for r = 1.0, 0.75, 0.50, 0.25, and 0.00, respectively. The overall significance is set at alpha = 0.05. The sample size is n = n1 = n2. An example is given in the text.

For serial time points.

Although Fig. 2 covers some useful experimental cases, it is more difficult to provide examples of general interest in serial time point measurements due to the large number of parameters. The power depends not only on the biological/experimental variations as well as the number of time points (T), but also the time unit of T. For example, for a fixed slope γ, the relative size of γT to the experimental standard deviation plays a crucial role. In other words, seven points in 1 day will be different from seven points in a week with the same growth rate γ in Eqs. 13 and 14. Thus the variance ratio (r) alone is not sufficient to determine the sample size.

Figure 3 shows one example on the sample size required for detecting mRNA expression changes over time. We used the data from Yang et al. (17). In that study, we found a ςe ≈ 0.40 for log2-transformed ratios between the sample and reference. If one assumes ng = 5,000, ςγ = 0.5, and 5% or less overlap between the two conditions (effect size = 3.29), and a twofold change in expression at the end of the time series (γT = 1), then the sample size curves for the one-sample and two-sample designs can be obtained. The Bonferroni correction (n0 = 0) and expected 1% of genes to be selected (n0 = 50) for replicate sizes m = 1, 2, and 4 are shown. The FDR is set at 0.05, and the power is set at 0.80.

Fig. 3.

Sample size required by one-sample and two-sample (n1 = n2) sequential experiments based on a specific data set described in the text. The array size is fixed at ng = 5,000. The alternative is a twofold increase in gene expression at the end time. T is the number of observations between time 0 and T. From the top to the bottom, the curves are for m = 1, 2, and 4.

Note that in this design, the end time point is fixed. Thus an increase in the number of time points can merely add points between the predefined start and end points. Since the highest difference happens at the end point, there is only a small increase in power by observing more than four time points. It is also quite intuitive that T can compensate for the number of replicates (m), because both of them increase the information for the treatment effect when the treatment is effective.


This paper provides a reasonable way to estimate the sample size for microarray experiments. The analysis methods are chosen to be simple in order to make the unknown parameters as few as possible. More sophisticated methods may be used after the data has been collected, but based on our experience, different analysis methods usually produce similar results for the prominent gene effects.

The computer software is very easy to comprehend. Thus a user can easily modify the programs when other tests are more appropriate. For example, if treatment-time interactions are a main concern, then this term can be easily added to the SAS procedure. The alternative under other models, if known, can also be easily generated using the simulation program with simple modifications.

The sample size determination discussed in this paper depends on the expected number of genes n0 to be accepted after the experiment. As we mentioned before, if there is no way to determine this number, then the experimenter may choose to use the most conservative number n0 = 0. Optimal sample size depends on experience and the validity of certain assumptions. Of course, there is a certain possibility of failure that an experimenter must face. Even to determine the sample size for the simplest two-sample t-test, we must assume that the two treatments have the same variance. There is no guarantee on this before the experiment is done. Similarly, in most microarray experiments, a large n0 can be expected. If the result shows the assumed n0 is too large, then we must admit our misjudgment and may choose to do more experiments to increase the power. This strategy is usually more economical than choosing a large sample size by always assuming the worst scenario.

Zien at al. (18) have presented a method similar to ours. In addition to the t-test, they also mentioned the Mann-Whitney nonparametric test and a model with additive noise. However, they did not address serial time points nor the use of the FDR in the analysis. Their results are based on a simulation which is not as accurate as the SAS output based on a noncentral F distribution. Black and Doerge (2) also used the t-test for the power calculation, but they considered only the number of replications with a Bonferroni correction. Subject level variation and FDR were not considered. Pan et al. (13) discussed sample size based on the mixture model. However, under the null hypothesis when all the means are zero, the mixture model is difficult to identify. Thus it may not be worth the effort to go through such a complex model building procedure at the design stage. The paper by Hwang et al. (9) tried to determine the sample size for finding the best discriminant function for the data similar to those given in Ref. 7. But the goal of finding the discriminant function is different from identifying genes with different expressions. The use of two highly correlated genes adds little discriminant power, but both of them are important to be identified.


Adjustment of the Significant Level Based on the False Discovery Rate

In single hypothesis testing, the significance level α can be also interpreted as the false discovery rate, FDR, i.e., Math In multiple hypotheses testing, if all the null hypotheses are true, then the Bonferroni correction is also the same as the FDR. However, if at least n0 > 0 genes are expected to be accepted, then Math Moreover Math where α′ is the adjusted significance level used in each test and ng′ is the number of genes with no expression differences. When n0 = 0, it means that there may be no gene accepted. If this is true, then there is no false discovery. However, n0 is only the expected number; there may be genes accepted after the experiment. Since the smallest number is 1, α′ is the Bonferroni correction. This gives the max(1, n0) in Eq. 2.


Data Analysis and Power Computation

Two sample one time point per subject.

The data analysis is based on Eq. 7. For a given w, Δ = 2Φ−1(1 − w), and the power is Math where Fν1(ξ,λ2) represents the F distribution of 1, ν = n1 + n2 − 2 degrees of freedom with noncentrality parameter λ2 evaluated at the threshold ξ = F1ν, α′, which is the upper α′ quantile of the F distribution with 1, ν = n1 + n2 − 2 degrees of freedom, and Math with r defined by Eq. 9. Since r and m are confounded, we may let m = 1 in the computer program by changing r.

One sample serial time points experiment.

The analysis is based on the mixed model with time as a fixed effect and the subject as a random effect. The SAS procedure is given in the software.

For the power simulation, data are generated by Math(15) where k in Eq. 12 is embedded into yjt and εjt so they are averages over m values, and δj is a normal random deviate with 0 mean and variance ςγ, εjt are standard normal deviates, and Math The generated data is then tested by the mixed model procedure for the significance of the time effect. The proportion of the rejected samples are the estimated power of this test. The mixed model takes care of the random effect from subjects. Users may refer to a SAS example in Ref. 11 (p. 90) to see how SAS program is written.

Two sample serial time points experiment.

This analysis is based on a mixed model with time and treatment as fixed effects and the subject as a random effect.

Without loss of generality, we may let m = 1 and μ1t = μ2t = 0. For a given T and w, data are generated by Math(16) where δij and εijt are generated by the same procedure in Eq. 15 with γ1 = 0 and Math The power is again estimated the same way as the one-treatment case.



The software is located at http://web.stat.ufl.edu/∼yang/microarray for the microarray sample size determination.

Program one_time.sas.

This subroutine contains the SAS program for one time point study based on Eq. 3.

Program sequent_1.sas.

This subroutine contains the SAS program for one sample serial time points measurements based on Eqs. 12 and 13.

Program sequent_2.sas.

This subroutine contains the SAS program for one sample serial time points measurements based on Eqs. 1 and 14.

All the input instructions are given by the comments in the program. Examples corresponding to the sample sizes in Fig. 2 and Fig. 3 are in the input (∗.sas) and the output (∗.lst) programs.


  • Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).

    Address for reprint requests and other correspondence: J.-X. She, Center for Biotechnology and Genomic Medicine, Medical College of Georgia, 1120 15th St., PV6B108, Augusta, GA 30912 (E-mail: jshe{at}mail.mcg.edu); or M. C. K. Yang, Dept. of Statistics, Univ. of Florida, Gainesville, FL 32611 (E-mail: yang{at}stat.ufl.edu).



View Abstract