## Abstract

Gene expression microarrays have been the vanguard of new analytic approaches in high-dimensional biology. Draft sequences of several genomes coupled with new technologies allow study of the influences and responses of entire genomes rather than isolated genes. This has opened a new realm of highly dimensional biology where questions involve multiplicity at unprecedented scales: thousands of genetic polymorphisms, gene expression levels, protein measurements, genetic sequences, or any combination of these and their interactions. Such situations demand creative approaches to the processes of inference, estimation, prediction, classification, and study design. Although bench scientists intuitively grasp the need for flexibility in the inferential process, the elaboration of formal supporting statistical frameworks is just at the very start. Here, we will discuss some of the unique statistical challenges facing investigators studying high-dimensional biology, describe some approaches being developed by statistical scientists, and offer an epistemological framework for the validation of proffered statistical procedures. A key theme will be the challenge in providing methods that a statistician judges to be sound and a biologist finds informative. The shift from family-wise error rate control to false discovery rate estimation and to assessment of ranking and other forms of stability will be portrayed as illustrative of approaches to this challenge.

- statistical genonics
- proteomics
- microarray experiments

here, we will discuss some of the unique statistical challenges facing investigators studying high-dimensional biology, describe some approaches being developed by statistical scientists, and offer an epistemological framework for the validation of proffered statistical procedures.

## 1: CHALLENGES OF HIGH-DIMENSIONAL BIOLOGY

Advances in modern technologies have led to the generation of tremendous amounts of genomic, proteomic, and other “omic” data. The term “high-dimensional biology” (HDB) has been proposed for investigations involving such high-throughput data. Types of HDB information are whole genome sequences and polymorphisms, expression levels of genes, protein abundance measurements, and combinations thereof. New tools address important biological questions such as the following. What is a gene (54)? How many genes do we have, and how can we determine their functions? What are the fundamental principles of metabolic processes? The identification of biomarkers, effects of mutations, and effects of drug treatments and the investigation of diseases as multifactorial phenomena can now be accomplished on an unprecedented scale.^{1}

Our paper is targeted to statisticians, biologists, and those “hybrids” that increasingly bridge the interface. This is important because genomics is a rapidly evolving field, and we believe that progress can best be made when these mutually dependent disciplines work together. Therefore, we try to appeal to each of these groups, recognizing that, in some places, the level may be seen as somewhat esoteric for one and somewhat basic for another. We discuss various issues related to drawing inferences from the HDB data. These include the increasing interest in questions involving the testing of multiple propositions simultaneously; appropriate inferential indicators for the types of questions biologists are interested in; and the need for replication of results across independent results, even (perhaps especially) in the face of extremely small *P* values.

The novel and challenging biological questions asked from HDB data have resulted in many specialized analytic techniques being developed. Often statistical methods applied to HDB data lack demonstration of their validity. It is important to evaluate the validity of the statistical methods applied (50, 86). In particular, we discuss resampling-based approaches for HDB data as a case study and illustrate some of the epistemological issues related to such inferential methods.

In summary, questions that biologists want to ask from HDB data require novel analytic approaches. Thus it is important that the statistical methods applied to HDB data are aimed at drawing inferences biologists are interested in and also that these analytic methods have sound epistemological foundations. While design issues are as important as inferential issues, we do not dwell on them here. Issues such as sample size and power estimation (24, 45, 62, 83) technical and biological replication (60), for definition of these terms see (11), and optimal pairing of samples (as well as dye labeling assignment) in two-channel microarray assays (76), and other topics are addressed elsewhere in the literature (6, 11, 17, 18, 39, 49, 52, 59, 65, 84).

## 2: INFERENCE IN HDB

We can address many questions via HDB data. Examples include the following. Which genes cause or are associated with a disease? Which genes are involved in a particular pathway? Which genes are differentially expressed under which condition? We will address some of the inferential procedures used to answer these questions.

### 2.1: A Framework for One Gene

For ease of illustration, suppose there are two treatments, *treatment A* and *treatment B*, and gene expression is recorded for only one gene on each of *N* arrays. (We use the terms array and tissue/organism/etc. interchangeably, assuming that each biological specimen will have its own array.) Furthermore, suppose that *Y* is the variable denoting the outcome under *treatment A*, and *X* is the variable denoting the outcome under *treatment B*. We are assuming that each array presents a single measurement for each gene, such as occurs with single-channel arrays or two-color arrays where a ratio or log ratio represents “expression” for a gene. The *N* arrays are randomly divided into two treatment groups, *group A* and *group B*, of size *n* and *m*, respectively, where *n* + *m* = *N*. The standard random sampling framework is that *Y*_{1},…,*Y*_{n}∼^{iid} *F*_{Y}(*y*;μ_{Y},σ_{Y}) and *X*_{1},…,*X*_{m}∼^{iid} *F*_{X}(*x*;μ_{X},σ_{X}), where *iid* stands for identically and independently distributed, and the parameters μ_{Y},σ_{Y} and μ_{X},σ_{X} designate the mean and standard deviation of the two population models *F*_{Y} and *F*_{X}, respectively. When we are interested in determining whether the gene is differentially expressed because of treatment, the quantity representing this differential expression is frequently μ_{Y} − μ_{X}. A null hypothesis of no differential expression, *Ho*:μ_{Y} = μ_{X}, is evaluated using a test statistic computed on observed data and a *P* value computed under some assumed reference distribution (i.e., a distribution of a test statistic when *Ho* is true). The choice of an appropriate reference distribution is important for computing valid *P* values leading to meaningful tests of differential expression. Valid *P* values will be obtained if the required assumptions for the test are met, and they may not be valid of one or more assumptions are not met. In microarray experiments, sample sizes are often not large, and the expression levels may not be normally distributed. Alternative methods for obtaining a reference distribution fall under a category of methods sometimes called nonparametric or distribution free. These terms are slightly misleading in that parameters may still be estimated using these methods, and some assumptions regarding the distribution of data are required. The hypothesis-testing framework above refers to a classical frequentist approach. There are alternative approaches such as Bayesian. For more information about such approaches, please refer to *section 2.5*. We discuss two types of techniques that use the observed data and resampling strategies to produce a reference distribution for a test statistic.

### 2.2: Resampling-Based Procedures for HDB Data

Resampling-based inference (RBI) relies on resampling data instead of theoretical distributions to make inferences. RBI has the advantage of being robust and flexible enough to accommodate almost any novel statistic [e.g., the statistic after shrinkage of variance, from Cui et al. (15)] without the need for methodologists to perform analytic derivations but has the disadvantage of being computationally intensive. There are marked differences in how such approaches are implemented, and some confusion and uncertainty remain. Microarray investigators who use RBI often do not discuss these issues nor state why one RBI approach (e.g., bootstrap) is chosen over another (e.g., permutation testing)^{2} or vice versa.

Frequently, unrecognized complexities arise when RBI is used for complex experiments, for example, when testing the difference between two groups (e.g., old and young mice) after controlling for some other factors (e.g., body fat). There are multiple ways to permute data in such circumstances, and only some will produce valid inferences (38). Another issue is the sampling unit, where a common error occurs in treating the gene rather than the case as the unit of analysis, such as in gene class testing (GCT) (40, 86). GCT attempts to conduct more powerful or informative analyses by testing for differential expression in entire predefined classes of genes rather than each gene singly. However, some GCT methods erroneously resample the genes rather than the cases. This type of resampling essentially ignores dependence among genes, ignores sample size, and can result in nonsensical results (e.g., tests whose power does not increase with sample size). Fortunately, several newer GCT methods are available that do not make this error (3, 27, 47). Some software use resampling procedures [e.g., the significance analysis of microarrays (SAM) algorithm] (70) but combine all resampled test statistics across all genes to obtain very small *P* values. This practice of combining all resampled statistics is valid under the assumptions that *1*) the null distribution of the test statistic is the same for all transcripts, and *2*) all transcripts are independent. Unfortunately, there is no reason to assume either (40, 41). Therefore, some microarray software leaves the decision to the user and offers a choice of pooling or not pooling the resampled test statistics across genes (15, 78). Consequently, how we can obtain the benefit of pooling RBI statistics across transcripts without incurring the detriment of falsely assuming *1*) or *2*) is an important question meriting research. Two RBI techniques, randomization and the bootstrap, are discussed in more detail below.

#### 2.2.1: Randomization.

Continuing with the one-gene framework in *section 2.1*, where *n* arrays are randomly selected to receive *treatment A* (i.e., *Y* is observed) and the other *m* receive *treatment B* (*X* is observed), the null hypothesis (*Ho*) is described as a statement: the treatment assignment has no effect on the gene expression. A test statistic, *t*, is computed to represent a difference in gene expression due to treatment.

Suppose that *r*_{j} is the gene expression level of the *j*th array, which represents a value of *Y*_{j} or *X*_{j} depending on the treatment assignment for that array. The vector (*r*_{1},…,*r*_{N}) represents the gene expression outcomes from the experiment for all *N* arrays. Under *Ho*, a randomization distribution of the test statistic is created by randomly choosing *n* values from (*r*_{1},…,*r*_{N}) to represent outcomes to *treatment A* with the other *m* values representing outcomes to *treatment B*. Each time this is done, a test statistic, *t**, is computed, yielding *t*_{1},…,*t*_{C}, where *C* = *N*!/(*n*!*m*!) is the number of unique treatment assignments, and each value of *t** is equally likely under *Ho*. The *P* value is then computed as the proportion of *t** that is as extreme or more extreme than the test statistic *t* observed from the data produced from the actual treatment assignment. The exactness of this *P* value depends on an assumption that is sometimes called unit-treatment additivity, meaning *Y* = *X* + τ, where τ is a constant treatment effect; τ = 0 under the null hypothesis of no differential expression. Note that unit-treatment additivity necessarily implies that the variances of the two outcome variables, *Y* and *X*, are the same and that the two distributions above, *F*_{Y} and *F*_{X}, differ only in their location (i.e., their mean). This assumption is sometimes referred to as exchangeability, meaning observations are exchangeable across treatment conditions under *Ho* or that the joint distribution of (*X*, *Y*) is the same as that of (*Y*, *X*).

As sample sizes become larger, the number *C* above becomes too large to allow for computation of exact *P* values. In this case a Monte Carlo approximation (i.e., choosing, for example, *C* = 10,000 random permutations) can give accurate results. If sample sizes are too small, these randomization tests produce *P* values that are too coarse, and it will often be algebraically impossible to obtain *P* values below some specified level (25). For instance, if *N* = 6 and *n* = *m* = 3, only 10 two-tailed *P* values are possible, the smallest being 0.1. In a microarray experiment, this makes using these *P* values for estimating false discovery rates and determining sets of genes that are most significant difficult at best. Note that the common rank-based tests (e.g., Mann-Whitney and Wilcoxon rank tests) are randomization tests after the appropriate transformation of data to ranks. Another limitation of randomization tests is that, in more complicated designs (e.g., multiple treatment groups and blocking variables, possibly with continuous covariates), implementing the permutations can be more complex and computationally time consuming. In some situations, the bootstrap can be useful.

#### 2.2.2: Bootstrap.

The bootstrap is another resampling approach, and it can produce results similar to those of randomization tests in some applications although it is fundamentally different (20). Consider the one-gene framework in *section 2.1* where *Y*_{1},…,*Y*_{n}∼^{iid} *F*_{Y}(*y*;μ_{Y},σ_{Y}) and *X*_{1},…,*X*_{m}∼^{iid} *F*_{X}(*x*;μ_{X},σ_{X}). The bootstrap works because the empirical distribution function of the data is a nonparametric maximum likelihood estimate of the population distributions. The empirical distribution function for *Y*, for example, is a discrete distribution with a mass of 1/*n* on each observed *Y* value. Repeated sampling from a population is approximated by resampling with replacement from the original sample. This is the usual nonparametric implementation of the bootstrap, but there are others. One is the parametric bootstrap, where the form of the distribution (say normal) is assumed for *F*_{Y} and *F*_{X} but the parameters are estimated from the data. Bootstrap samples are then drawn from the estimated normal distribution.

In hypothesis testing, care must be taken when implementing the resampling procedure to ensure that the bootstrap samples are drawn under the case where the null hypothesis is true. Hall and Wilson (29) provide some guidelines for appropriate resampling and for obtaining more accurate *P* values using the bootstrap. There are advantages and disadvantages of bootstrap tests relative to other resampling-based tests. See Ref. 28 for more details.

### 2.3: Differential Expression for a Study of K Genes

We now consider resampling strategies for simultaneously testing hypotheses for differential expression on *K* genes. A clear understanding of the sampling unit is critical for the proper implementation of the resampling techniques described earlier. The distribution models become multivariate, and the first of the two samples is now denoted as follows: *Y*_{1}, *Y*_{2},…, *Y*_{n}∼^{iid} *F*_{Y}(*y*,μ_{Y},∑_{Y}), where *Y*_{j} = (*Y*_{1j},…,*Y*_{Kj}), is a *K*-dimensional vector of gene expression values for the *j*th array, μ_{Y} is a *K*-dimensional vector of the mean expression for each gene represented on the arrays, and ∑_{Y} is a *K* × *K* variance-covariance matrix. Similarly, the second sample, *X*_{1},…,*X*_{m}∼^{iid} *F*_{X}(*x*,μ_{X},∑_{X}), is now a multivariate random sample from a multivariate population model. The earlier discussions for testing a single gene for differential expression across the two treatment conditions would, with this multivariate structure, now be a test on a marginal distribution for a single gene; there are *K* marginal distributions, one for each gene. Gene-specific tests for differential expression seem to be the preferred method for analyzing microarray data, although methods that test for classes of genes have been proposed [e.g., Barry et al.(3)]. The result from gene-specific tests is a distribution of *K* test statistics or *P* values. The most differentially expressed genes are determined by a ranking procedure or assignment of a posterior probability (refer to *section 2.5* for discussion on the Bayesian approaches) of being differentially expressed. As mentioned earlier, bootstrap and randomization tests can produce coarse distributions of test statistics (and, hence, *P* values), making it impossible to identify a list of the most promising candidate genes. It is tempting, therefore, to take advantage of the large number of genes and permute or resample genes themselves. However, genes are not exchangeable, and variance of gene expression values is not homogeneous across genes.

The variance-covariance matrix for each distribution can be written as, for example, ∑_{Y} = *diag*(σ_{Y1},…, σ_{YK})*P*_{Y}*diag*(σ_{Y1},…, σ_{YK}), where *diag*(σ_{Y1},…,σ_{YK}) is a diagonal matrix with entries σ_{Yi}, which are the standard deviations of gene expressions for the *i*th gene under *treatment A*, and *P*_{Y} is a *K* × *K* correlation matrix with 1 on the diagonal and off-diagonal entries ρ_{Y,i,i′}, which is the correlation between gene expression levels for gene *i* and gene *i*′ under *treatment A*. Similar notation holds for random variables under *treatment B*, except that the variable *Y* is replaced by *X*. Resampling or permuting at the level of the gene will not preserve the structure of the correlation matrix. The appropriate procedure to mimic sampling from the multivariate distributions, *F*_{Y} and *F*_{X}, will require sampling at the level of the array. To overcome coarseness in the distribution of test statistics, some have fit a model to the gene expression data that includes gene effects and resampling or permuting the residuals from the model (e.g., Ref. 14). Whether the ANOVA model produced independent residuals with equal variance (across genes) is not clear. Methods involving mixed-effects models and/or empirical Bayesian methods involving variance shrinkage have been proposed to address inferential issues associated with unequal variances across genes (2, 23, 33, 77).

If one develops a method that requires exchangeability across genes (or sets of genes), then the sensitivity to violations of this assumption could be evaluated using simulation procedures, as in Allison et al. (1) or Gadbury et al. (25). Others have considered the issues of obtaining a correct reference distribution using appropriate resampling procedures (34, 80). Directly related to the issues of appropriate resampling procedures for valid inference are the issues of how to simulate data appropriately, that is, to mimic a sample produced from *F*_{Y} and/or *F*_{X}. More discussion of this appears in *section 3.2*.

### 2.4: Intersection-Union Testing

Biologists often wish to address questions involving multiple propositions simultaneously. For example, they may wish to ask whether a particular gene is differentially expressed in both of two (or more) tissues in response to some common stimulus or whether a particular gene is linked to a particular phenotype in both of two (or more) species. These questions about multiple propositions involve “and” rather than “or” questions. Investigators are asking whether each of several propositions is true, not whether any of several propositions is true. This is an important goal for biologists, and the distinction is important.

Traditionally, statisticians have devoted much attention and effort to testing multiple hypotheses or propositions simultaneously. Much of the multiple-testing literature has involved union-intersection tests (UIT), where the goal is to test the intersection of all null hypotheses against the union of all alternative hypotheses. All of the classic corrections for multiple testing are UITs. Examples can be found in the traditional literature on this topic (69) and even in the more modern literature that uses resampling-based methods [e.g., Westfall and Young (75)]. UITs test whether any of the null hypotheses is false, not whether all are false. In many cases, this is wholly appropriate, but this does not fully address the questions that biologists have when they ask about multiple propositions. To address some questions posed by biologists, such as in the examples offered above, intersection-union testing (IUT) is required (5, 64).

Most investigators using applications are unfamiliar with the formal aspects of IUTs. In trying to accomplish their goals, they often use homegrown approaches. For example, Kyng et al. (42) investigated differences in gene expression among cell lines of normal young individuals, elderly individuals, and individuals with Werner syndrome (WS; a disease of premature aging). They compared the cell lines of the old with those of the young and identified a number of transcripts that appeared to have significant differences. They then compared the cell lines from the WS individuals with those of the young normal individuals and also obtained a list of transcripts that appeared to have significant differences. The authors then noted that a great deal of overlap existed between the lists of genes obtained in these two analyses. They interpreted this to be evidence supporting the conjecture that WS is, at the transcript level, a process highly similar to accelerated normal aging. Although the conjecture may be correct, this particular line of evidence in support of it is highly questionable. First, by using a common control group, the authors introduced a dependency into the sample mean differences for old vs. young and WS vs. young. A model proper for examining whether the degree to which such overlap in the lists of differentially expressed genes is above that expected by chance would need to take this into account. Second, the authors do not consider that multiple transcripts may be correlated, that is, that not all gene transcription levels are dependent. This further complicates matters and renders highly questionable simply looking at the proportion as, for example, a two-by-two table of differentially expressed and not differentially expressed in each of the two conditions. Finally, in treating genes as simply having or not having evidence for differential expression, one loses the continuity of the data and the evidence they have to offer. Although statistical methodologists have spent a great deal of effort on multiple-testing corrections, investigators have been given little help with the IUT framework.

Classical IUT entails the use of the MIN-Test (44). In the MIN-Test, one rejects the union of null hypotheses in favor of the intersection of alternative hypotheses if and only if the largest *P* value for the statistical significance test on each of the individual component hypotheses is less than or equal to some preset alpha (α) level. This approach has several problems, the first of which is that the IUT will not have a predefined size. That is, using this approach, one can only state that one's type I error rate for the IUTs will be less than or equal to α, not equal to α. Because of this, the tests will not be very powerful under many circumstances. The second problem is that this method again does not fully utilize the continuous information available in the data about the strength of evidence. For example, if *investigator A* conducted two-component hypothesis tests (e.g., gene *X* is differentially expressed under *condition 1*, and gene *X* is differentially expressed under *condition 2*) and obtained *P* values of 0.049 and 0.048 for the two component hypotheses, *investigator A* would be able to state that the IUT was significant at the 0.049 level. Similarly, if *investigator B* conducted the same two experiments and obtained *P* values of 0.049 and 10^{−20}, *investigator B* would still only be able to say that the IUT null hypothesis was rejected at the α level of 0.049. This seems incongruous given the clearly differential evidence that the two investigators obtained. Unfortunately, getting out of this difficulty may not be possible in a strict frequentist paradigm.

As we begin to integrate more information across more experiments, conditions, methods, tissues, and species and move into the age of “integromics” (37, 74), omic investigators will need IUTs more frequently. It is incumbent on us to give the people what they want and begin developing better IUTs. The Bayesian framework in which one fits models to large collections of data by modeling the *P* values (1), test statistics (35), or effect-size indicators (51) may offer an excellent approach in this regard. Additional research is warranted to extend the methods developed from simply looking at a single vector of results from one microarray experiment to examining multiple vectors of results from multiple microarray or other omic-level experiments.

### 2.5: Which Inferential Indicator Do Biologists Want and Understand?

The staple of inference in modern statistics is the frequentist *P* value. As mentioned in *section 2.1*, this value indicates the probability of obtaining data that depart as much or more from the expectation under the null hypothesis as the data that were actually observed if, in fact, the null hypothesis were true. Perhaps not surprisingly, this somewhat cumbersome description is not easily grasped by many nonstatisticians. Indeed, survey research has shown that many nonstatisticians believe that a *P* value indicates the probability that the null hypothesis is false (79). A smaller but still nontrivial proportion of nonstatisticians believe that a *P* value indicates the probability that a result will not replicate if the experiment is repeated. The fact that most nonstatisticians misunderstand *P* value suggests two things. First, if we are to continue using *P* values, we need to work more diligently to help our colleagues understand what they are. Second, classic frequentist thinking may not effectively capture how most nonstatisticians think, and they may be more amenable to other indicators of evidential strength. Most physicians and physiologists seem intuitively to be Bayesians and comfortable talking about the probability that a null hypothesis is true or false. That is, they are interested in a probability that the null hypothesis is true given the observed data rather than probabilities associated with observed data given a true null hypothesis, which is the basis for the definition of a *P* value. Bayesian techniques that facilitate these interpretations are available and being used more frequently in HDB investigations (81).

It is ironic then that, as the current age of discovery-based HDB bloomed, the first reaction of many statisticians was to offer family-wise error rate (FWER) methods (19, 72, 73). FWER is the probability of making one or more type I errors in a family or set of comparisons or inferences (21). For example, in a microarray study for detecting differentially expressed genes between two conditions, FWER could be defined as the probability of incorrectly identifying at least one differentially expressed gene among those genes that are not truly differentially expressed (48). Although this has some appeal and some correction for multiple testing does seem necessary, when we communicate with our colleagues who use these applications and ask “Do you wish us to be certain, within some very small probability such as 5%, that we never tell you that a particular gene is differentially expressed (or linked, or associated, etc.) in this study of tens of thousands of genes when in fact it is not differentially expressed (or linked, or associated, etc.)?”, the response is invariably “No.” Our colleagues in the biology labs tell us that they do not mind if we give them a few false positives as long as a substantial proportion of the results offered are not false positives. This essentially describes the false discovery rate (FDR), and it appears that biologists are far more interested in maintaining a reasonably low FDR than a FWER. Loosely defined, FDR is the expected proportion of “findings” that are in fact erroneous (for further discussion and some useful techniques, see Refs. 67, 68). How low an FDR must be may vary from investigator to investigator and context to context, but it suggests that, as statisticians, we may have overemphasized FWER control. Statistical geneticists have embraced this, and an enormous amount of research over the past 5 years has gone into building newer and better FDR procedures.

A detailed review of specific FDR methods and related approaches is beyond the scope of this paper. To place FDR and related techniques in historical and conceptual context, we note that the concept of posterior probabilities having a relation to *P* values goes back for decades (58). The idea of controlling FDR by examining distributions of *P* values goes back at least to 1982 (61), and, to our knowledge, the term FDR was coined and the concept first thoroughly elucidated in 1995 (4). Calculations of posterior probabilities (a very close relation to FDR estimation techniques) in microarray work appeared in the published literature at least as early as 2000 (45). Modeling the *P* values in microarray studies to produce FDR and posterior probability estimates appeared in the published literature by 2002 (1). Several papers reviewing and exploring the connections among FDR and related techniques have now appeared (57), and papers comparing the empirical performance of various FDR methods are starting to appear. Interestingly, such empirical comparisons do not show any one method to be consistently better than all others (7, 53, 56, 82).

This work on FDR procedures has largely been spurred by microarray research but is now being embraced in many other areas. However, one must wonder whether even FDR is what we are really interested in. Consider the case of a very complex trait, obesity. Given the number of genes that have been knocked out and shown to cause an obesity phenotype, ethylnitrosourea (ENU) mutagenesis deletions that have been created, and resulting obesity phenotypes, Jurgen Naggert (personal communication, 2004) estimated that there could easily be on the order of several thousand genes contributing to obesity. If we allow ourselves a rough assumption that these genes are approximately evenly distributed throughout the genome, no region of the genome cannot be linked to obesity. If every region of the genome is linked to obesity, then in what sense can we actually even talk about false discoveries in a genome scan for obesity? There can be no false discoveries because every area of the genome is linked to obesity. The question of interest may instead be about which other regions have the strongest linkage with obesity.

We are no longer interested in strictly testing the truth or falsity of null hypotheses but, rather, in ranking the alternative hypotheses with respect to the strength of evidence in favor of them and/or the strength of the parameter quantifying the alternative hypothesis. How shall we best give investigators information about this ranking, and how shall we quantify our confidence in any given ranking? This is an area that people are beginning to investigate (55). More research is clearly needed in this area, and we look forward to having this research come to fruition.

### 2.6: Need for Replication vs. a Single Small P Value

In this section, we refer to replication in its traditional usage as attempting to reproduce entire studies through obtaining new data and not to the design issue of obtaining multiple data points within a study [with respect to the latter, see Churchill (11)]. Many investigators have long held the need for replication before a finding is accepted. Such calls for replication have been especially strong within the community of investigators studying complex genetics. Recently, a method was proposed in which a very large number of hypothesis tests were conducted in a single sample, and then a few of the most significant results were confirmed or “replicated” in the same data by using a different test that was independent of the first test (43). This creative approach is likely to be much used but may not give applied investigators what they want and need.

Years ago, Lykken (46) described different types of replication. He offered that perhaps we should be most confident in a finding when we can replicate it using methods most different from those that resulted in the original finding. This call for “constructive replication” seems to be nowhere more appropriate than in the field of complex trait genetics. For example, Zhang et al. (87) reported a finding that the gene encoding the enzyme responsible for the synthesis of brain serotonin, tryptophan hydroxylase-2 (hTPH2), exhibited a functional variant associated with major depression. This variant, GA1463A, attenuated the capacity of hTPH2 in serotonin production by 80%. This study also observed a strikingly high frequency of this variant in a cohort of unipolar depressed subjects. However, independent large studies later by several other groups were not able to reproduce these results or even find that the polymorphisms existed (26, 71, 90). These puzzling findings suggest that no amount of replication within a single sample, however statistically sophisticated, can substitute for true constructive replication. This problem is not unique to the field of genomics. In 2000, Siefe (63) investigated what might be called the “5 sigma problem,” whereby certain findings from the field of physics, although seemingly incontrovertible from the basis of their vanishingly small *P* values, nevertheless turned out to be incorrect. Apparently, this occurrence is not all that uncommon. According to Siefe, this may result from an overly optimized experiment to detect an event or systematic technical errors (e.g., incorrect settings in computers or a faulty lens used for recording observations) getting introduced in an experiment, leading to 5 sigma results.

Examples of 5 sigma results that fail to replicate abound in the field of complex trait genomics. For example, an exciting paper on obesity was published recently (30). This paper found that a polymorphism in the gene INSIG2 was highly significantly associated with obesity in multiple samples. The initial finding came by use of the self-replication method mentioned earlier and held up for multiple separate samples. In this same paper, however, a large sample of 2,700 subjects failed to confirm the finding. The failure to find such a result seems most unlikely given the sample size of 2,700 and given its relative strength in the other samples. Moreover, in commentary in the journal, P. Froguel (13) said that he also failed to observe the result in a separate sample of 10,000 subjects. This clearly seems like a 5 sigma anomaly. It underscores the need for true constructive replications and perhaps the need for our scientific community to begin more formal investigations of how to conduct and consider replication findings.

## 3: EPISTEMOLOGICAL FOUNDATIONS OF RESAMPLING-BASED INFERENTIAL METHODS

Epistemology is a branch of philosophy that deals with the nature, origin, and scope of knowledge. Mehta et al. (50) and Zakharkin et al. (86) offer guidelines about evaluating the validity of statistical methods in HDB and illustrate some of the seemingly obvious but not universally appreciated statistical issues. In an experiment, we observe measurements of variables from samples drawn from populations, and these observations form our base data. Using these data, we wish to make inferences from imperfect measurements about the real variables they represent and then from the sample cases to the whole population. Some authors resample across genes when trying to assess the stability of certain microarray results, whereas the samples-to-population perspective holds that the sampling units should be cases (e.g., mice) and not genes (22). A very recent paper from Klebanov and Yakovlev (40) explains the rationale of not drawing inference by using genes as sampling units and the risk involved in doing so. Similarly, methods were proposed where inferences about gene expression differences between populations are made by comparing observed sample differences with an estimated null distribution of differences based on technical rather than biological replicates. This conflates the standard error of measurement with the standard error of the sample statistic. We offer a framework for evaluating the validity of inferential methods for HDB.

### 3.1: Mathematical Proofs

Methods should be evaluated with respect to well-defined objective criteria such as sensitivity, specificity, type I error rate control, power, unbiasedness, reproducibility, etc. The classical way to characterize the operating characteristics of methods is via mathematical proofs. Mathematical proofs are well accepted but can be difficult, if not impossible, in many situations, particularly when typical assumptions (e.g., independence and distribution assumptions) are not met. In such situations, the use of simulations becomes key.

### 3.2: Simulations

The importance of simulations is well known to many statisticians, and such readers may choose to skip this section. Simulations are valuable, because some truth about *F*_{Y} and *F*_{X} is known to the investigator, who can then determine how well a statistical method reveals this truth. Initial procedures for simulating microarray data assumed *F*_{Y} and *F*_{X} to have a parametric form (often a normal distribution). Methods for determining the mean vectors for these distributions were usually reasonable, determining the variances were more difficult, and determining the correlation matrices, *P*, were nearly impossible. Many simulations used the identity matrix for *P* and simulated genes as independent variables. Allison et al. (1) and Gadbury et al. (25) used a block diagonal structure for *P*. For example, if 10,000 genes are being studied, then *P* would be block diagonal with 20 blocks of dimension 500. Each block would be an equicorrelation matrix with 1 on the diagonal and ρ everywhere else, and ρ would vary over different simulation runs to simulate weak, medium, or strong dependence among groups of genes. Although some argument could be made that this is reasonable, it is likely far from ideal, in that the correlation structure of an actual data set is unknown and negative values for ρ produce negative definite matrices. As larger studies have become available, it has made sense to borrow the concept of the bootstrap to simulate data with the same variance-covariance structure as an actual data set. This has been termed a plasmode simulation and is described later.

### 3.3: Plasmodes

#### 3.3.1: What are plasmodes?

Concerns about how well simulated data correspond to reality can partially be addressed by using plasmodes. To our knowledge, the term “plasmode” was introduced as early as 1967 by Cattell and Jaspars (8). A plasmode is a real (i.e., not computer simulated but from actual biological specimens) data set for which some aspect of the truth is known. Plasmodes can be used to learn about the validity and lack of validity of certain statistical methods for microarray analysis. The great advantage of plasmodes is that, unlike with computer simulations, one need not question whether the particular distributions or correlations are realistic because they are taken directly from real data. The availability of large plasmode databases of microarray data allows a method to be evaluated with hundreds if not thousands of data sets.

#### 3.3.2: Different ways of creating plasmodes.

##### 3.3.2.1: IN WET LAB.

Plasmode data sets can be created in wet labs by directly manipulating biological samples. A simple example is a real microarray data set with specific mRNAs spiked in (12, 32). Evaluating whether a particular method can correctly detect the spiked mRNAs gives information about the method's ability to detect gene expression. An outstanding resource for the field is Affycomp (32), a set of tools and plasmode (spike-in) data sets on an integrated web site that allows investigators to analyze the same benchmark data sets using a new method and then compare results with those of many other methods that have been applied to the same data (31, 89). The nonspecific binding data set in Zhijin et al. (88) and the “controlled” data set in Choe et al. (9) can also be considered to be plasmode data sets created in wet labs. With respect to the latter data set (9), there has been recent debate as to whether it was constructed in such a way that allows it to be legitimately useful for method evaluation (10, 16). Nevertheless, plasmodes are the start of a valuable resource for the scientific community. Certain data sets are commonly used (e.g., yeast cell cycle data) and could become de facto standards for methods evaluation (66).

##### 3.3.2.2: APPLYING RESAMPLING-BASED METHODS ON REAL DATA SETS WITH UNKNOWN TRUTH.

Plasmodes could also be created from a real data set where the truth is unknown by applying resampling-based methods. The randomization technique can be used to create plasmode data sets where the null hypothesis is true. Null plasmode data sets can be used in at least two ways. First, the type I error rate of the method can be estimated by applying a method to each data set. Any significant results obtained are, by definition, type I errors (false positives). Second, anyone wishing to evaluate the reproducibility (stability) of results produced by one or more methods can apply each method separately to two groups within each data set and examine the extent to which similar results are obtained. Methodologists can evaluate how successful any particular method of analyzing data is in detecting the true effects and not detecting the null effects and then be able to have common, identifiable, benchmark data sets for comparing analytic approaches.

Plasmode data sets with modest or large effects can also be created. A real template data set can be selected that involves two groups in which there appear to be real but small differences in gene expression (i.e., only a modest portion of genes differentially expressed with only moderate difference in expression across the two groups). The mixture-modeling approach used for power projections (1, 24) is used to fit a model to the data; the proportion of differentially expressed genes and the distribution of standardized mean differences in expression for those differentially expressed genes can then be estimated. The estimated effects from the template data set (scaled to the study-specific variances) can be added to a particular group (of the two groups) of the null data sets. This will create a data set with a realistic distribution of null and nonnull effects and, again, preserve the marginal distributions and covariances among gene expression levels. This process can be repeated to create plasmode data sets with different template data sets in which there appear to be moderate-to-large real differences in gene expression. This method of plasmode generation differs from a standard simulation study, in that data are generated in such a manner that the covariance structure among all genes and the marginal residual distributions for all genes are preserved and generated from real data and therefore, by definition, known to be realistic. In contrast, it is recognized (36) that no one knows how to simulate data that have such characteristics.

Methodologists can evaluate how successful any particular method of analyzing data is in detecting the true effects and in not detecting the null effects. The result will be common, identifiable, benchmark data sets against which they can judge the comparative performance of analytic approaches.

## 4: DISCUSSION

In conclusion, the fields of complex trait genetics and HDB have led to new challenges to our ability to offer veridical findings and provide statistical methods that provide information that biologists find to be germane and are epistemologically sound. Statisticians have begun to offer new approaches to this and to exciting developments that have broken us out of our traditional approaches in favor of new epistemological criteria.

## GRANTS

This work was supported by the University of Alabama at Birmingham Statistical Genetics Postdoctoral Training Program Grant No. T32-HL-072757-04 and the Center for Nutrient-Gene Interaction in Cancer Prevention Grant No. 5U54-CA-100949-04 from the National Institutes of Health and grant no. 0217651 from the National Science Foundation Plant Genome Research Program (Design and Analysis of Microarray Gene Expression Studies in Plants: Toward Sound Statistical Procedures).

## Acknowledgments

Present address of S. O. Zakharkin: Solae, LLC, St. Louis, MO 63188.

## Footnotes

↵1 The 2nd International Symposium on Animal Functional Genomics was held May 16–19, 2006, at Michigan State University in East Lansing, MI, and was organized by Jeanne Burton of Michigan State University and Guilherme J. M. Rosa of University of Wisconsin, Madison, WI (see meeting report by Drs. Burton and Rosa,

*Physiol Genomics*28: 1-4, 2006).↵2 It is common that the terms permutation test and randomization test are used interchangeably, and we do so here.

Address for reprint requests and other correspondence: D. B. Allison, Section on Statistical Genetics, Dept. of Biostatistics, and Clinical Nutrition Research Center, Dept. of Nutrition Sciences, Ryals Public Health Bldg., Suite 327, Univ. of Alabama at Birmingham, 1665 University Blvd., Birmingham, AL 35294 (e-mail: Dallison{at}UAB.edu).

Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).

- Copyright © 2007 the American Physiological Society