Curve-based multivariate distance matrix regression analysis: application to genetic association analyses involving repeated measures

Rany M. Salem, Daniel T. O'Connor, Nicholas J. Schork

Abstract

Most, if not all, human phenotypes exhibit a temporal, dosage-dependent, or age effect. Despite this fact, it is rare that data are collected over time or in sequence in relevant studies of the determinants of these phenotypes. The costs and organizational sophistication necessary to collect repeated measurements or longitudinal data for a given phenotype are clearly impediments to this, but greater efforts in this area are needed if insights into human phenotypic expression are to be obtained. Appropriate data analysis methods for genetic association studies involving repeated or longitudinal measures are also needed. We consider the use of longitudinal profiles obtained from fitted functions on repeated data collections from a set of individuals whose similarities are contrasted between sets of individuals with different genotypes to test hypotheses about genetic influences on time-dependent phenotype expression. The proposed approach can accommodate uncertainty of the fitted functions, as well as weighting factors across the time points, and is easily extended to a wide variety of complex analysis settings. We showcase the proposed approach with data from a clinical study investigating human blood vessel response to tyramine. We also compare the proposed approach with standard analytic procedures and investigate its robustness and power via simulation studies. The proposed approach is found to be quite flexible and performs either as well or better than traditional statistical methods.

  • similarity analysis
  • longitudinal profiles

the completion of the Human Genome Project (37, 83) and the International HapMap Project (36), as well as recent improvements in high-throughput genotyping technologies (5, 21, 50, 59, 61), has led to a paradigm shift in research studies investigating the genetic determinants of diseases. This shift has been away from focused “candidate” gene and laborious family-based linkage studies to population-based genome-wide association (GWA) studies (31, 85). GWA studies have identified a number of common polymorphisms contributing to many complex traits and diseases (30, 56, 78). However, the common polymorphisms identified to date that are associated with most diseases or traits via GWA studies collectively explain only a small fraction of the phenotypic expression of these diseases and traits (26, 85). This is in part due to the complexity of the diseases themselves (3, 6, 8, 25, 42), but also due in part to the study designs and assumptions made during the conduct of GWA studies (8, 58, 65, 73, 94). In addition, of the polymorphisms identified from GWA studies, only a small fraction admit easy biological interpretation as to the reason why they are associated with a particular disease or trait. The remainder will require sophisticated functional analyses before their contribution to normal physiological and pathophysiological processes will be understood.

One way to improve the yield of true positive genetic associations, as well as biological insights obtained from those associations, is for researchers to consider more elaborate phenotyping protocols in relevant association studies (9, 10, 58). For example, the traditional use of a phenotypic measurement (or group of measurements) at a single point in time in contemporary candidate gene and GWA studies does not do justice to the likely age or time-dependence of the disease process or trait in question. In addition, if interest is in studying the underlying physiological effects of DNA sequence variations in a particular gene on the expression of a phenotype, then studying that phenotype in a number of contexts as a function of genotype categories would be appropriate (92).

Although the collection of longitudinal or repeated measures (i.e., different dosages, different environmental contexts, etc.) on any disease process or trait is likely to be expensive and difficult to organize logistically (58), it is arguable, for the reasons just provided, that ignoring the “dynamic” and/or context-dependent nature of human phenotypic expression will compromise the understanding of the biological effects of DNA sequence variations. To take advantage of longitudinal and repeated measures in genetic association studies, not only will better phenotyping protocols and technologies have to be developed, but appropriate data analysis methods will have to be devised. In this paper, we consider the genetic analysis of dynamic complex traits (DCTs) (92), or quantitative phenotypes measured over time and/or in different contexts for which a “profile” of the phenotype across the time points or contexts can be obtained for each individual in the study. Although many statistical methods have been proposed for longitudinal studies (e.g., prospective cohort designs) and repeated-measures analyses that could be adapted to work in genetic association studies (9, 13, 24, 57, 68, 84), including approaches to model multivariate time course data, e.g., microarray time course data (64, 75). There is limited precedent for working with traditional data analysis methods in genetic association analyses involving DCTs, except in family-based linkage studies of complex traits (24, 33, 51, 52, 76, 91, 92, 94, 95, 99). In addition, many of these methods do not use all the available data and often employ data reduction strategies that further limit the data used in an analysis. For example, some methods in family-based linkage analyses reduce the data collected on a DCT to a single parameter governing a curve fit to the data, such as slopes obtained from a linear regression model fit to each individual's data (24, 33, 52, 91, 92, 95, 99). Recently, a functional mapping approach has been proposed, in which a parametric function is fit to family-based data to provide an assessment of genotype-specific trajectories over time; this overcomes some of these limitations and better utilizes the data (51). This approach has been extended to population-based designs (44, 86) and the use of nonparametric functions to model longitudinal changes (93). This approach assumes a single curve captures genotype effects and therefore does not allow individual subjects to have a unique function model their individual trajectories and, as a result, does not allow direct comparison of individual functions across the individuals in a study.

We introduce a novel nonparametric statistical framework to model and analyze DCTs, which we term curve-based multivariate distance matrix regression (CMDMR), which was motivated by aspects of functional mapping (51, 52, 95), similarity analysis (1, 4, 20, 60, 69), and functional modeling (12, 14, 15, 18, 23) analysis approaches. This framework involves the assessment of the similarity of the longitudinal or repeated-measures profiles of individuals in a sample. To assess similarity in longitudinal or dosage studies, we first fit a function, either parametric or nonparametric, to the DCT data obtained on each person. Note that the data do not have to be collected at uniform time points across the subjects. The fitted functions are used as a surrogate for the “true” function underlying each subject's DCT. A dissimilarity (or “distance”) between subject DCT profiles is then calculated from the curves, and then related to genetic and environmental factors via the multivariate distance matrix regression (MDMR) statistic. Modeling subject-specific-derived curves in this manner avoids the potential loss of information when group average curves are utilized (46, 87). In situations in which data might not be amenable to curve-fitting, such as repeated measures made in qualitatively distinct (e.g., environmental) contexts, the data points for each subject can be used to construct profile similarity, but this would require that all subjects have had measures obtained in each context. MDMR is a versatile and robust statistical analysis method that has been shown to have applicability in a wide variety of research settings involving high-dimensional biological data types, such as multilocus genetic association studies (45, 63), admixture/population genetic analyses (70), GWA (88, 97), gene expression microarray data analysis (1, 60), imaging data, and ecological studies (12, 14, 18, 23).

The proposed procedure is quite flexible and can accommodate weighting schemes in the calculation of the similarity of two individuals' profiles as well as multiple covariates. We showcase the proposed analysis method with repeated-measures data from a study investigating the in vivo physiological effects of genetic variations on blood pressure changes induced by the infusion of increasing dosages of tyramine. We also compare the proposed approach with standard quantitative statistical methods for repeated-measures analysis. In addition, we explore the utility and power of the framework in a variety of settings via simulation studies. The proposed CMDMR analysis method and our application to human in vivo physiological assessment suggest new directions for genetic epidemiologic investigations in areas that combine population science, human physiology and genetics, and quantitative science.

METHODS

CMDMR

We describe the CMDMR approach to genetic association analysis of DCT data by discussing the steps necessary for the analysis: 1) Fitting curves to individual DCT profiles (if appropriate), 2) Computing a distance matrix, and 3) Conducting a test of association via MDMR analysis.

Fitting curves to individual DCTs.

We assume that the ith subject's L measurements have been generated by an unobserved function: yj=f(tj)+εj,j=1,,L, (1) where t denotes the time or dose associated with the jth measurement, f(t) is an unobserved function, and ϵj are random errors assumed to follow a normal distribution with 0 mean and unknown variance, N(0,σ2). All available data from individual i are used to reconstruct the unobserved function (Supplemental Figure S1).1 Fitting curves to the data, as opposed to using the data points themselves, in an analysis offers several advantages by allowing for potential interpolations and extrapolations over missing or sparse data collections, as well as allowing each individual to have different numbers of measures taken at different times/dosages, as discussed below. Once functions are fit to the data (with confidence) these functions are then compared for their similarity, as discussed below. A wide variety of curves can be used for this purpose, including both parametric and nonparametric curves, simple polynomials, smoothing splines, regression splines, or kernel smoothers (18, 82). For simplicity and computational efficiency, we utilized polynomial splines (87, 97) (Fig. 1 and Supplemental Fig. S1).

Fig. 1.

Response profiles for tyramine hand vein study. Individual dorsal hand vein percent distension (y-axis) by dose (pmol·kg−1·min−1) of infused tyramine (x-axis) plots for n = 49 hand vein study subjects.

As noted it is possible to conduct the analysis on the observed phenotypic measures themselves and not use fitted curves. Consider a simple example, wherein each subject has complete phenotypic data Yij collected for a set of predefined measurement points, Tij; Let subject i = 1,…, N have phenotypic measures Yi = (Yi1,…, YiL), where j = 1,… L, and time points or dosages Ti = (Ti1,…, TiL), which are the same for all subjects. In this scenario, all subjects have identical measurement number, frequency, and contexts of measurement. An N × N similarity/dissimilarity matrix can be calculated directly using the complete set of DCT measurements for each pair of subjects via, e.g., the Euclidean distance metric. Note that this scenario is likely to be exceptional, as many subjects are likely to have missing data or have been measured at different times for different total numbers of observations. Construction of the distance matrix using only the complete measures, Yi = (Yi1,…, YiL), would be problematic in these realistic situations (39, 49, 96) as it would require nontrivial assumptions about patterns in missing data (19, 32, 35, 39, 79, 81).

Several strategies have been proposed to address situations in which individuals have had data collected at different numbers of data points or at different time intervals (19, 32, 53, 54, 62, 79), though all suffer from limitations (32, 53, 62), and none are widely accepted (39, 53, 54, 62, 71, 72, 77, 81, 96). In brief, one approach would either restrict construction of the distance matrix to the set of individuals with complete data (though this may result in a substantially reduced sample set) or construct the distance matrix using available data for each pair of subjects. The latter approach may be most suitable when the sample L is large and the amount of missing data is small, although the “critical level” of missing data (<1, 5, 10%, etc.) that can be tolerated requires additional study. In addition, the use of incomplete data may be problematic if study subjects are missing key measurements (89).

Choosing a distance measure and constructing a distance matrix.

Once one decides if it makes sense to use fitted curves or the observations themselves is made, a measure of dissimilarity of the fitted curve values or observed values for each pair of subjects can be computed. There are many distance measures that can be used for this purpose (16). For example, the Euclidean, Manhattan, or Mahalanobis distance measures could be used. The choice of a distance measure is important and could impact the results of the CMDMR approach. Once a measure is chosen, it can be used to construct an N × N distance matrix. It should be noted that although many statistical analysis procedures that leverage distance measures require that the distance measure have metric properties, the CMDMR approach does not have this requirement (28). Let this distance matrix and its elements be denoted by D = di,j (i,j = 1,…, N), for the N subjects. The possibility that NL does not pose problems for the CMDMR approach. In the studies described below, we considered the use of 36 different distance measures, listed in Supplemental Table S1, to explore the relative merits of each.

MDMR.

In addition to the set of phenotype values Yi = (Yi1,…, YiL) collected on each subject, it is assumed that an additional set of M predictor or ancillary variables has been measured on the N subjects as well. The relationship between these M predictor variables and the observed patterns over the individual DCT profiles Yi = (Yi1,…,YiL) is the focus of CMDMR. The predictor variables may include information on genetic polymorphisms [single nucleotide polymorphisms (SNPs)] and phenotypic and environmental variables (e.g., coded using dummy variables, such as 0 assigned to individuals not possessing a certain factor and 1 assigned to individuals possessing a certain factor) in addition to additional covariates such as, sex, smoking status, etc.

Let X be an N × M matrix of predictor variables and covariate that will be modeled as predictor or regressor variables whose relationships to the values in the DCT distance matrix are of interest. Compute the standard projection matrix, H = X(X°X)−1 X°, typically used to estimate coefficients relating the predictor variables to outcome variables in multiple-regression contexts. Next, compute the matrix A=(aij)=(12dij2) and center this matrix with use of the transformation discussed by Gower (60) and denote this matrix G derived as: G=(I1n11)A(I1n11)

An F statistic can be constructed to test the hypothesis that the M regressor variables are associated with variation in the distances between the DCT profiles for pairs of individuals reflected in the distance/dissimilarity matrix (60): F=tr(HGH)tr[(IH)G(IH)G(IH)] (2)

If the Euclidean distance is used to construct the distance matrix on a single quantitative variable (i.e., as in an univariate analysis of that variable) and approximate numerator and denominator degrees of freedom are incorporated into the construction of the test statistic, then the F statistic in Eq. 2 is equivalent to the standard ANOVA F statistic (1, 2, 16, 27, 38, 55, 60). The distributional properties of the F statistic are complicated for alternative distance measures computed on the basis of more than one variable, especially if those variables are discrete, as in genotype data (1, 60). However, permutation tests can then be used to assess statistical significance of the pseudo-F statistic (1, 2, 16, 27, 38, 55, 60). A detailed discussion of the MDMR pseudo-F statistic properties is in preparation. The M regressor variables can be tested individually, together or in a stepwise manner. Finally, the percentage of variation in similarity within the distance matrix explained by the regressor variables, r2, can be calculated as follows: r2=tr(HGH)tr(G)

We have used permutation tests to determine significance of the relationship between profiles and SNPs via CMDMR. A minimum of 10,000 permutations were performed for this purpose. The MDMR procedure discussed here has been implemented in R and is available from the authors upon request.

Traditional Statistical Methods

Many analysis schemes can be used to relate variation in DCT profiles to a single factor or set of factors. For example, simple linear regression models, repeated-measures ANOVA, and general linear mixed models, MANOVA, random effects models, and marginal models [i.e., generalized estimating equation (GEE)-based models] could be used. We compare these models with the proposed CMDMR approach with the tyramine infusion data described below. It should be noted that some of these methods are less than optimal approaches for analyzing repeated measures but are included for comparison purposes. The basic structure, key assumptions, generalizability, extensions, and limitations of each approach are briefly considered in the discussion. Additional information, comprehensive overviews, and discussions of these analyses methods are available (22). We also compare the proposed CMDMR approach with a linear mixed model approach via simulation studies.

Study Populations and Simulation Studies

Data from the tyramine hand vein study (22), a multiethnic clinical investigation involving 49 research subjects, are used to showcase the proposed genetic analysis framework. Each subject received increasing doses of tyramine (8 doses: 4.49, 8.98, 22.4, 44.9, 89.8, 180, 449, and 898 pmol·kg−1·min−1) injected into a dorsal hand vein. Hand vein distension was measured at 9 times: baseline + 8 doses. The goal of the study was to assess baseline and genetic influences on response to the tyramine infusion. This is an ideal study design to assess analysis methods for simple DCTs, since the data are highly structured, and complete data exists for all subjects with uniform measurements and spacing of measurements. We could thus simulate different settings with these data by creating missing values. The response profiles for all 49 subjects are presented in Fig. 1, illustrating the complexity and challenge of such data. Genotype information for 23 SNPs across 14 candidate loci was obtained on all subjects.

Simulated data based on the tyramine hand vein study were used to assess the power, limitations, and comparison of the proposed CMDMR framework under a variety of scenarios. We contrasted the performance and power of the CMDMR to a linear mixed model analysis approach, including analysis of the both main effects of each SNP as well as SNP × time interaction effects jointly. Simulated data were generated for three hypothetical dose-response profiles: linear, nonlinear, and complex (Fig. 6, A–C, See supplemental appendix formulas used to define the functional form for each simulation scenario). Simulated curves for each scenario were developed and assigned to populations of 25, 50, 75, 100, 125, and 150 subjects based on SNP allele frequencies of 0.30 for the minor allele under Hardy Weinberg equilibrium. The simulated curves were generated with an error term to assigned points to model measurement error. A total of 1,000 simulated data sets were generated for each scenario and sample size for power calculations. Power was calculated by counting the number of times the test statistic achieved a P value <0.05 divided by the total number of simulations. Permutation tests were performed to determine appropriate P values for MDMR analyses (n = 1,000). Tyramine hand vein SNP analysis results differed slightly from the previously reported (34, 80); since analyses were performed combining all ethnic groups with a covariate for ethnicity included in the analysis, the genotype model assumptions (i.e., additive, previously dominant) were more comprehensive and covariates for adjustment (age, sex, and ethnicity). Power analysis was performed on the true curve values, sampled points, and simulated curve values using a Euclidean distance to construct the distance matrix.

RESULTS

Comparisons Standard Statistical Models: Tyramine Data

Ten traditional statistical analyses approaches were considered: Simple linear regression, repeated measures (RM) ANOVA/general linear models, MANOVA, profile analysis, two-stage analysis, random effects model, linear mixed models [three specifications: subject specific (SS), population average (PA), and both interpretations (SS + PA), and marginal models (GEE)] were applied to the tyramine infusion data. The results are reported in Table 1 and Fig. 2. Table 1 presents P values derived from tests of the hypothesis that the SNPs influence tyramine response for a representative subset of SNP. Figure 2 presents a graphical comparison of the 10 traditional statistical models for all n = 22 SNPs in the form of a tree diagram or dendrogram representation (41). This dendrogram was constructed by taking the P values associated with each SNP effect from each method and using them to construct a Euclidean distance-based distance matrix. The tree is constructed from the distance matrix such that methods (based on SNP P values) with greater similarity are closer to each other (i.e., adjacent branches of the tree). Less similar methods, in terms of SNP P values, are represented as branches further away from each other. Note that we used the module “neighbor” in the program PHYLIP v. 3.64 to construct the neighbor-joining tree displayed in Fig. 2. This graphical representation provides a measure of determining how similar the results of each analysis type were over all the SNP loci associations considered in the study.

View this table:
Table 1.

Comparison of standard statistical models

Fig. 2.

Tree diagram of standard statistical models. Dendrogram of 10 standard statistical models is constructed such that methods (based on n = 25 SNP P values) with greater similarity are closer to each other (i.e., adjacent branches of the tree) and less similar methods, are further away from each other (i.e., distant branches). Branches for interaction terms are highlighted in green for models with interaction terms. See methods for details on statistical methods reported. LMM, linear mixed model; R, repeated; RC, random coefficient; RM, repeated measures; PA, population average; SS, subject specific.

The simple regression model tended to overinflate the significance of the SNP effects, which is expected since the correlation between measurement points in this analysis method is ignored with this procedure. The simple regression model was most similar to the RM-ANOVA model. Results from RM-ANOVA and MANOVA differed from each other as well as the other models. This difference may be attributable to the different covariance structure assumption in each of these analysis procedures: the restrictive compound symmetry structure for RM-ANOVA was assumed, whereas an unstructured covariance matrix was assumed for MANOVA analysis. The MANOVA model showed the greatest dissimilarity with the other models, essentially occupying its own unique branch in the tree. The random effects model, the three different mixed models (SS, PA & SS + PA), and the GEE models showed agreement for both the SNP effects and the SNP × dose interaction terms. Surprisingly, the two-stage model (performing linear regression on slope estimates from individual linear estimates) yielded results that showed some consistency with interaction terms for the random effects, mixed, and GEE models (Fig. 2).

Comparisons CMDMR Curves vs. Points: Tyramine Data

Analysis of the tyramine data was performed via the CMDMR framework for the 36 different distance measures. We consider both the analysis of the DCT profiles based on the 9 data points collected at baseline and each of the 8 dosages alone as well as data generated from the polynomial curves fit to these data points. Results from both approaches are presented in Table 2 and Fig. 3.

View this table:
Table 2.

Comparison of CMDMR analysis

Fig. 3.

Tree diagram of multivariate distance matrix regression (MDMR) results. Dendrogram of MDMR analysis results for curves (blue-colored branches, n = 36) and points (red-colored branches, n = 13) is constructed such that distance measures (based on n = 25 SNP P values) with greater similarity are closer to each other (i.e., adjacent branches of the tree), and less similar methods, branches are further away from each other. See methods for details on distance measures reported.

The P value results showed a pattern of clustering for similar distance measures and data source (i.e., the use of fitted curves vs. the data points alone). The exceptions were the analyses based on the correlation distance measures (i.e., correlation, Pearson, Kendal and Spearman), which are found to cluster near each other (Fig. 3) but away from the other distance measures. However, the analysis of these distance measures to the fitted curve values and the observed data points alone types did not yield equivalent results, suggesting that the curve-fitting process had an effect on the analyses. In general, the results show greater similarity between comparable distance measures within, rather than across, analyses involving curves or the observed phenotype values. The analyses with the phenotype values revealed few if any significant associations (Table 2).

In contrast, the curve-based analyses produced results that were more similar to the traditional analyses, in addition to revealing evidence for association of the tyramine BP profiles and an SNP in the trace amine receptor-3 (TAR3), SNP rs2842899, as it showed evidence of association for 18 of the distance measures and near significance (P value <0.10) in an additional six measures (Table 2). The fitted values from the linear mixed model resembled the average spline curve for the TAR3 SNP (Fig. 4). In contrast, the previously reported chromogranin B (CHGB) SNP rs236153 was not significant with CMDMR curve analysis, suggesting potential limitations in the CMDMR approach with noisy curves (Supplemental Fig. S2). Finally, the linear mixed model and spline curve fits (individual and population average) are plotted for the nonsignificant tyrosine hydroxylase (TH) SNP rs10770141 (Supplemental Fig. S3). Both the linear mixed model and average spline fit suggest no association by genotype is present. However, it remains an open question as to the most reliable way to analyze the data, as the results from the traditional statistical models may not be valid due to incorrect assumptions (i.e., no residual correlations among the phenotype values collected on each subject and linearity model assumption).

Fig. 4.

Linear mixed model vs. spline fits by TAR3 rs2842899 SNP. Dorsal hand vein percent distension (y-axis) vs. dose (pmol·kg−1·min−1) of infused tyramine (x-axis) by TAR3 rs2842899 (Lys61stop) genotype are plotted for subject-specific linear mixed model fits (A), individual spline fits (B), and average spline fit (C).

The 49 distance matrices (n = 36 curve-based matrices and n = 13 observed data points only-based matrices) used in the CMDMR analyses of the tyramine data were compared using Procrustes analyses (40) and presented as a dendrogram in Fig. 5A. In addition, the pattern of correlation between the Euclidean distance matrix form observed phenotype values (Fig. 5B), and the fitted curves (Fig. 5C) was assessed via heat maps (43). The heat maps clearly illustrates that the distance matrices have different patterns of correlation, with stronger correlation (darker colors) for the analysis involving the fitted curves. A formal comparison of the distance matrices using the Mantel test (47, 48, 90) indicated that the distance matrices had a low but statistically significant correlation (r = 0.1288; P value = 0.00069).

Fig. 5.

Comparison of distance matrices. A: dendrogram of curves (blue-colored branches, n = 36)- and points (red-colored branches, n = 13)-derived distance matrices compared using Procrustes analyses. Heat-map illustration of pair-wise similarity between Euclidean points (B)- and curves (C)-based distance matrix. Distance matrices normalized (comparable scales); darker colors indicate greater similarity, and lighter colors less similarity between subjects.

Comparison CMDMR vs. Linear Mixed Model (SS): Simulation

Simulated dose-response profiles were generated to explore and contrast the power of CMDMR and linear mixed model approaches to the analysis DCT profiles. The performance of the CMDMR was compared against a linear mixed model with subject specific specification under three scenarios (Fig. 6, A–C). Euclidean distance was used to calculate the distance matrix CMDMR analyses for these simulations. The first simulation modeled a linear dose-response profile that differed by genotype (Fig. 6A). CMDMR analysis of the observed (simulated) phenotypes gave a near identical power profile as the main effect parameter from the subject-specific linear mixed model (Fig. 6D). CMDMR of analysis of the true (unobserved) and fitted curves were also nearly identical, suggesting that, for a linear response profile, CMDMR on a set of points gives similar power as does a linear mixed model main effect. The CMDMR analyses of the curves (both true and fitted curves) may add noise to analyses in this simple situation.

Fig. 6.

Simulated curves and power plots. Individual specific curves for the 3 simulated dose-response profiles scenarios are presented in A–C for a sample of 50 individuals with minor allele frequency of 30%. Individual curves are shaded by genotype: homozygous wild type (black), heterozygous (gray), and homozygous variant (blue). See supplemental appendix for curve parameters and details. Results of the power analyses are presented in D–F for LMM results: main effect (gray solid line), LMM SNP × time effect (gray dashed line), and curves-MDMR: unobserved true curve (black dashed line), fitted curve (black line) and points (black dot-line).

In scenario two, we modeled a nonlinear dose-response profile (Fig. 6B). CMDMR analyses involving the underlying true curves used to generate each hypothetical individual's data gave the best power profile. The power profiles for the fitted curve CMDMR, the CMDMR analysis of the observed phenotype values, and the linear mixed model main effect term produced very similar results (Fig. 6E). This may suggest that all three methods were able to capture the overall differences by genotype group.

In the third scenario, we consider a complex scenario, in which the dose-response profile was both nonlinear and had an SNP × time interaction effect (Fig. 6C). Analysis of such challenging data is quite difficult, and the linear mixed model performs poorly: both main effect and interaction terms had low power to detect a difference. In contrast, the CMDMR analysis of the observed phenotype values and the fitted curves performs significantly better than the linear mixed model. CMDMR analyses of the true curve and fitted curve gave the highest power and were essentially identical.

Finally, we explored the power of 13 different distance measures in the three scenarios discussed above for the CMDMR analysis. The Euclidean distance measure performed very well in all three scenarios. Comparable distance measures showed similar power profiles across the scenarios. The correlation-based distance measures performed poorly in the first two scenarios. Power was improved for the correlation-based measures in scenario three, although was still lower than the other distance measures (results not shown). The influence of minor allele frequency, varying from 1% to 30%, was also explored, in all three scenarios, increasing MAF improved power for all statistical tests (results not shown).

DISCUSSION

Genetic researchers interested in studying DCTs face many challenges. These challenges center around the conduct of analyses that consider the full range of phenomena that may be worth considering and extracting from the phenotypic data, including capturing trends over time, accounting for correlation between measurements, accounting for the structure and frequency of measurements, missing data, adjusting for covariates and time-changing covariates, and interactions (such as gene × gene and gene × environment interactions).

Our analyses of the tyramine hand vein data revealed the limits of existing traditional analysis methods for the assessment of associations between genetic variations and DCTs. Results differed for tyramine data by type of statistical model, particularly between RM-ANOVA, MANOVA, and the more advanced models (random effects, linear mixed models, and GEE). These models, however, gave similar, though not identical, results. Fitting the appropriate linear mixed models is itself problematic since specifying the wrong covariance structure could lead to unreliable results (4, 20, 69). The simulations show that the linear mixed model (subject specific) performed well in certain situations, though performance declined when complexity in the longitudinal profile increased (e.g., nonlinear, nonmonotonic profiles). In contrast, the CMDMR-based analyses performed well in all three simulations.

The idea of calculating similarity between curves has been previously suggested (67). The challenge of ascertaining a valid and accurate measure of similarity or distance between a pair of curves is an important issue, as our simulation studies demonstrate. In general the comparison of curves or functions presents several challenges, including accounting for time trends and effects, correlations between measurements, high dimensionality of data, (L>>N), and curve alignment (20). Our finding that the use of individual observed phenotypic points can sometimes outperform analysis based on fitted curves in the CMDMR approach in certain scenarios is consistent with previously reported observations that tests based on large numbers of time points obtained from fitted curves may not provide advantages over the use of the data points themselves (13). Moreover, this suggests that poor curve fitting may introduce an additional estimation error into the model. Error due to function estimation is not unique to the CMDMR framework, however, and is an issue for all methods that fit functions to individual DCTs: if the form of the curve is inappropriate for the data, the inferences drawn for those fitted curves are likely to be erroneous. The CMDMR approach can be extended to accommodate the uncertainty of the curve fit using weighting schemes, in addition to improvements in curve fitting methods.

The proposed CMDMR framework has many features that make it a good choice in analysis of DCTs: it can accommodate a wide range of longitudinal data structures, it is intuitively appealing, it handles situations in which there is a small overall sample size but many phenotypic values collected over time or across different doses, it models subject-specific profiles, it provides a multiple regression-like analysis interpretation setting, it has broad applicability and flexibility, and the underlying models can be improved through the use of different fitted curves and distance measures. In addition, data can be visualized using graphical tools to help in interpretation of findings, via heat maps or dendrograms. Finally, the CMDMR framework handles correlated data and missing observations without imposing a model on to the data and uses all values to model the data. This may have particular utility for analysis of complicated multivariate time-course data, such as time-course microarray data modeling gene expression profile across time or different conditions.

Traditional methods have many useful features but lack some of the advantages that CMDMR has. Both RM-ANOVA and MANOVA are extensions of ANOVA. Given their assumptions and limitations (e.g., no correlations between data points on the same subject), other techniques might be preferable (i.e., linear mixed model) (7, 11, 13, 17, 66, 84). General linear mixed models, which we refer to as “mixed models” (also known by various names such as multilevel models, linear mixed-effects models, random-effects models, random-coefficient models, hierarchical linear models, etc.), represent a robust set of statistical methods that are particularly well-suited for analyzing continuous correlated outcomes. Mixed models can be used for simultaneously modeling both fixed effects and random effects (subject-specific) of continuous outcome measures. For fixed effects, either the investigator has to choose levels specifically, or the levels must exhaust all possibilities. For the random effects, the sets or levels of the factors that have been included in study are viewed as a random sample from a very large or infinite population of possible levels. The flexibility of mixed models allows them to serve as a powerful statistical tool. Mixed models can accommodate common time points or time points that vary between subjects (irregularly spaced observations) and can accommodate both time-dependent covariates and static covariates. Additionally, mixed models offer the flexibility to fit a variety of correlation structures to be fitted (7, 13, 17, 29, 74, 84).

Marginal models can be used when the focus of the study is to investigate the effects of covariates on the population mean profile. GEEs are an approach that specifies only the marginal distribution of correlated outcome variables. GEE is an estimation technique that estimates a common scale parameter and a working correlation matrix of the outcome variables, treating them as nuisance parameters. In specifying only the marginal distribution of the outcome variable, GEE models will produce estimates of population parameters only. Therefore, GEE cannot be easily used in settings where subject-specific estimation and hypothesis testing are required. An advantage of GEEs compared with mixed models is that a covariance structure does not need to be specified but is rather estimated from the data. The primary advantage of this method is it does not make strict distributional assumptions like normality. If the number of subjects is sufficiently large, then results are relatively robust to underlying assumption; i.e., results remain approximately valid even when assumptions are violated.

The MDMR analysis procedure was implemented in R and is available upon request from the study authors (R. M. Salem, N. J. Schork). These results shed light on not only the most appropriate approach to the analysis of DCTs but also help promote their use in the study of hypertension and other common, complex diseases.

GRANTS

R. M. Salem was supported by National Institutes of Health (NIH) Training Grant T32-DA-007315 and UCSD Genetics Training Program Training Grant T32-GM-08666. N. J. Schork and his laboratory are supported in part by the following research grants: The National Institute on Aging Longevity Consortium (Grant U19 AG-023122-01); the National Institute of Mental Health-funded Genetic Association Information Network Study of Bipolar Disorder (Grant 1 R01 MH-078151-01A1); NIH Grants N01 MH-22005, U01 DA-024417-01, P50 MH-081755-01. R. M. Salem and N. J. Schork are supported in part by Scripps Genomic Medicine and the Scripps Translational Sciences Institute Clinical Translational Science Award (Grant UL1 RR-025774).

DISCLOSURES

No conflicts of interest, financial or otherwise, are declared by the author(s).

Footnotes

  • 1 The online version of this article contains supplemental material.

REFERENCES

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.
  38. 38.
  39. 39.
  40. 40.
  41. 41.
  42. 42.
  43. 43.
  44. 44.
  45. 45.
  46. 46.
  47. 47.
  48. 48.
  49. 49.
  50. 50.
  51. 51.
  52. 52.
  53. 53.
  54. 54.
  55. 55.
  56. 56.
  57. 57.
  58. 58.
  59. 59.
  60. 60.
  61. 61.
  62. 62.
  63. 63.
  64. 64.
  65. 65.
  66. 66.
  67. 67.
  68. 68.
  69. 69.
  70. 70.
  71. 71.
  72. 72.
  73. 73.
  74. 74.
  75. 75.
  76. 76.
  77. 77.
  78. 78.
  79. 79.
  80. 80.
  81. 81.
  82. 82.
  83. 83.
  84. 84.
  85. 85.
  86. 86.
  87. 87.
  88. 88.
  89. 89.
  90. 90.
  91. 91.
  92. 92.
  93. 93.
  94. 94.
  95. 95.
  96. 96.
  97. 97.
  98. 99.
View Abstract