We present an integrated approach for the enriched detection of genes subject to cis-acting variation in the mouse genome. Gene expression profiling was performed with lung tissue from a panel of recombinant congenic strains (RCS) derived from A/J and C57BL/6J inbred mouse strains. A multiple-regression model measuring the association between gene expression level, donor strain of origin (DSO), and predominant strain background identified over 1,500 genes (P < 0.05) whose expression profiles differed according to the DSO. This model also identified over 1,200 genes whose expression showed dependence on background (P < 0.05), indicating the influence of background genetic context on transcription levels. Sequences obtained from 1-kb segments of 3′-untranslated regions identified single nucleotide polymorphisms in 64% of genes whose expression levels correlated with DSO status, compared with 29% of genes that displayed no association (P < 0.01, Fisher exact test). Allelic imbalance was identified in 50% of genes positive for expression-DSO association, compared with 22% of negative genes (P < 0.05, Fisher exact test). Together, these results demonstrate the utility of RCS mice for identifying the roles of proximal genetic determinants and background genetic context in determining gene expression levels. We propose the use of this integrated experimental approach in multiple tissues from this and other RCS panels as a means for genome-wide cataloging of genetic regulatory mechanisms in laboratory strains of mice.
- allelic imbalance
- gene networks
- cis-acting regulatory variants
the application of gene expression profiling to model systems with controlled genetic variation provides new avenues for exploring the genetic basis of complex phenotypes and gene regulatory mechanisms (5, 11). Because regulatory variants are known to affect phenotypic end points and are believed to account for a large proportion of changes contributing to the evolution of complex traits (6, 26), this integrated approach brings the resolution of genetic mapping studies closer to the level of biochemical mechanisms and provides a system for linking gene expression changes to specific genetic variants. In higher mammals, gene expression profiles have been shown to be highly heritable (9, 34), mirroring the evolutionary relationships between species (16) as well as displaying variation between subpopulations of species (40).
Among the major goals shared by genetic mapping strategies and gene expression profiling is the determination of gene regulation mechanisms believed to underlie complex traits. Recent studies have demonstrated the utility of mapping gene expression traits to identify loci with relevance to physiological phenotypes (21, 49). Studies demonstrating the heritability of gene expression, mapping expression traits (4) as well as demonstrating the impact of genetic regulatory variants in human disease (34, 35, 49), further underscore the importance of characterizing gene regulation on a genome-wide scale.
The integration of gene expression profiling and genetic approaches offers the possibility of identifying regulatory variants affecting transcription on a genome-wide scale. Recent studies mapping expression traits aim to classify genes subject to cis- or trans-acting regulation (7, 8). Trans-acting regulatory variation involving variants located at a distance from the affected locus is believed to account for the majority of significant differences in gene expression and may include the spectrum of protein-DNA and protein-protein interactions known to influence transcription. In contrast, cis-acting regulatory variation, located on the same chromosome as the regulated gene, is estimated to affect at least 30% of genes differentially expressed between individuals (6) and can be detected by measuring allele-specific transcription ratios in heterozygous individuals with intragenic single nucleotide polymorphisms (SNPs) (59). When levels of the two transcripts are compared under identical cellular contexts in vivo, deviation from the expected 50-to-50 ratio of alleles is known as allelic imbalance (AI) and can indicate the presence of cis-acting regulatory variants (42).
Here we present an approach that applies expression profiling and AI assays to a panel of recombinant congenic strains (RCS) to detect genes subject to cis-acting variation (Fig. 1). RCS are derived by backcrossing and inbreeding two mouse strains, generating a panel of strains that are homozygous at every locus and that contain variable congenic segments from the genome of the donor strain (averaging 12%) on that of the background strain (14). A RCS panel of sufficient size (>20 strains) ensures that the majority (>90%) of genes are contained on donor congenic segments. RCS panels were initially designed as an experimental system for separating loci involved in multigenic traits, permitting each locus to be analyzed separately (52). The increase in mapping efficiency offered by RCS has led to identification of candidate genes (17) as well as multilocus interaction effects (55). We sought to determine whether RCS could be used to characterize the influence of genetic variation on gene regulation by studying a RCS panel derived from the reciprocal backcrossing of A/J and C57BL/6J parental inbred strains (18). These two strains have been widely studied as models for complex phenotypes including airway hyperresponsiveness (13), acute lung injury (43, 44), and lung cancer (2). RCS provide a particular advantage in detecting regulatory relationships as they enable one to assess the simultaneous influence of background genetic context on the association between expression phenotypes and cis-acting genetic variants (36). We obtained gene expression profiles for lung tissue obtained from the RCS panel to measure the association between transcript levels and donor strain of origin (DSO), taking into account the contribution of the predominant background strain. To validate these observations, we sequenced genes identified by expression analysis to characterize their SNP content and measured allelic imbalance in F1 offspring of an A/J × C57BL/6J cross.
MATERIALS AND METHODS
Tissues used in these studies were obtained from a RCS panel derived from A/J and C57BL/6J parents (18). The mice used in this study were part of a 600-animal experiment that took place over a 4-mo period. The mice used in the F1 validation study were reared in the same animal facility (with the same diet and housing conditions). All animals were killed at 16 wk of age, and tissues were harvested according to a standard dissection protocol. All mice were handled according to guidelines and regulations of the Canadian Council on Animal Care under protocols approved by the Animal Care Committee of the McGill University Health Centre Research Institute. Genotyping was previously performed with 621 simple sequence-length polymorphisms (SSLPs), providing an average distance between markers of 2.6 cM. Of the 26,746 genotypes used to characterize the RCS panel, 25,571 results agreed between replicates, 520 results disagreed between replicates (where one result was definite and the other undefined), 19 results were contradictory between replicates (showing results from opposing parental origins for the same locus), and 33 results were undetermined (18). When the replicate genotypes disagreed or were contradictory, the corresponding loci were classified as “undetermined.” Because undetermined loci were dispersed throughout the data set (across all markers and strains), their overall effect on the analysis was minimal. The RCS strain assignment of all mice used in this study was verified with genotypes obtained for a minimal set of the original markers.
The mice were housed with free access to food and water under conditions of 12 h of darkness and 12 h of light. After an overnight fast, the animals were killed and the lung tissues were rapidly dissected and frozen in liquid nitrogen. Tissue samples were obtained from two adult male mice for each strain in the RCS panel. The samples were homogenized in TRIzol reagent, and RNA was prepared according to the manufacturer's instructions. cRNA probes prepared from 10 μg of total RNA were hybridized to Affymetrix MGU74Av2 oligonucleotide arrays as described previously (39). Expression data were summarized with robust multiple average and quantile normalized with Bioconductor (http://www.bioconductor.org).
Probe target sequences from the Affymetrix MGU74Av2 oligonucleotide array were aligned against the February 2003 build of the NCBI mouse genome, using BLAT with default parameters (25). Of the 12,422 probe sets on the chip, 11,293 were localized uniquely. Physical positions of SSLPs from the RCS genotype data were retrieved by aligning marker sequences to the same assembly, using BLAST to identify all primer matches with intervening sequences of <500 bp (wordsize = 12, p = 0.90, e = 0.1) and iteratively relaxing matching parameters in the event of no match. SSLPs that were assigned to more than one genomic location were not used in subsequent analyses. This procedure matched 566 of 621 (90%) markers used to genotype the strains. We inferred DSO for each gene based on the DSO of surrounding SSLPs; genes flanked by markers of identical DSO were assigned the same DSO. If flanking markers differed, which would indicate a recombination event, the intervening genes were assigned an unknown DSO status because of the uncertainty as to the position of the recombination site. Genes at the ends of chromosomes were assigned the DSO of the closest SSLP. As expected, the rate of unknown DSO assignment correlates with the number of recombined segments, indicating that higher frequencies of recombination decreased the amount of data included in our analysis. This still permitted assignment of DSO to >90% of the probe sets on the array over all strains.
ANOVA was conducted per gene assuming independence of loci, using the linear model expression ∼DSO + BG + DSO*BG, where background (BG) was calculated for each strain as the ratio of total donor segment lengths over the entire genome with DSO from one parent over the other. Positive association between expression and DSO was assigned if P < 0.05 for DSO and P > 0.05 for BG. All analyses were conducted with R (http://www.r-project.org).
For SNP discovery, we randomly selected 50 genes that were positive for DSO association and 80 genes that showed no DSO association. The random selection was performed with a subset of transcripts whose mean expression over all strains exceeded 500 MAS 5.0 units. This restriction was imposed to ensure that the selected loci could be properly studied in our validation experiments. In addition, we excluded genes with documented alternative splicing in the Alternative Splicing Database (Ref. 53; http://www.ebi.ac.uk/asd/), as well as complex loci that had overlapping transcripts or reversed overlapping transcripts listed in the University of California-Santa Cruz Genome Browser database (23). For each gene, we sequenced 1 kb of 3′ untranslated region (UTR) in A/J and C57BL/6J genomic DNA. Primers for resequencing were designed with Primer3.0 (48) set at default parameters.
AI was measured with tissue obtained from five adult male mice obtained from a F1 A/J × C57BL/6J cross. Lung tissue was harvested at 13 wk, and RNA and genomic DNA (gDNA) were extracted with TRIzol reagent. Sequence reactions were performed with 20 ng of cDNA prepared with reverse transcriptase or 20 ng of gDNA for all SNP-containing genes with methods previously described (42). SNP peak heights in the cDNA and gDNA were compared with PeakPicker (http://www.genome.mcgill.ca/BingGe/PeakPicker/), which normalizes SNP peak heights against those of the surrounding sequence. This method can detect differences in allelic expression >1.2-fold (20). In addition, we independently tested the sensitivity of the sequence-based method to detect AI equally in genes from positive and negative sets, both with and without AI, by analyzing serial dilutions of F1 gDNA with gDNA from either parental strain. Dilutions were prepared corresponding to ratios of 55:45, 65:35 and 85:15 for each allele. Genes selected at random for this analysis included those from all possible combinations of DSO-expression association and AI results. Although this analysis assumes similarity between the behavior of the sequencing assay in cDNA and gDNA, serial dilutions confirmed sensitivity of the technique to allelic ratios of at least 1.2-fold for the majority of genes tested. This threshold represents a lower limit because dilutions of <55:45 (representing an allele ratio of 1.2) were not tested. This analysis therefore remains unable to assess the extent of false-negative results. AI was determined with a paired Student's t-test comparing peak height ratios of the two alleles in gDNA vs. cDNA for five replicate F1 mice. Genes were said to display AI if a two-tailed Student's t-test comparing peak ratios exceeded significance of P < 0.05 for at least one SNP found within 1 kb of 3′ UTR. Association of DSO expression analysis results with the SNP frequency or the AI frequency was done in R with a one-tailed Fisher exact test and logistic regression models: AI ∼ PVAL, and SNP ∼ PVAL.
We studied 30 strains from a RCS panel derived from the reciprocal backcrossing of A/J and C57BL/6J inbred strains (18). We used genotype data previously obtained for 621 SSLPs to infer DSO, defined as the parental strain from which a congenic segment derives, for the probe sets on the MGU74Av2 oligonucleotide array. We assigned “undetermined” status to 6.8% of the probe sets, with the total number of probe sets with undetermined DSO status ranging from 220 (BcA82) to 1,503 (BcA83) and with a maximum of 366 probe sets on a single chromosome with undetermined DSO (BcA74, on chromosome 8). The largest stretch of genes affected was on chromosome 10 for BcA70, where 46% of genes were assigned an undetermined DSO status. Over the entire RCS panel, between 2% (BcA82) and 13% (BcA83) of the probe sets had an undetermined DSO. Although the RCS panel strains contained between 26 and 56 congenic segments (average 43.7), there was no significant difference between strains with a predominant A/J (4–38 segments, average 22) or C57BL/6J (7–35 segments, average 22) background. The maximum number of segments per chromosome was 5 (for BcA83 on chromosome 6 and AcB56 on chromosome 11).
Expression microarray studies were performed with lung tissue obtained from pairs of 4-mo-old male mice from each of 12 AcB and 18 BcA RCS as well as the parental strains. We previously observed (Lee PD, Sladek R, Greenwood CM, Ge B, Skamene E, and Hudson TJ, unpublished observation) baseline gene expression differences between adult males of the A/J and C57BL/6J parental strains in multiple tissues and demonstrated widespread tissue-specific expression variability between these strains in four tissues. In light of these results, our present study used an ANOVA model that measured association between gene expression, DSO, and the predominant background strain. Variability between individual replicate mice was factored into the error term for the ANOVA model. By explicitly considering the effect of background genetic context on gene expression levels, this analysis strategy provides between 38 and 62 effective replicates per gene, corresponding to the number of expression measurements obtained for genes with either A/J or C57BL/6J DSO status regardless of predominant background strain. In the course of analysis, we noted genomic regions where most RCS share chromosomal segments derived from the same parental strain. This correlates with our observation of uneven partitioning of the data set by the two terms in the ANOVA model (DSO and BG). As a result, the effects of DSO and BG could not be separated for some of the genes (30%). However, the panel provided a sufficient degree of heterogeneity to test association with DSO, while adjusting for background, for 8,860 genes.
Of the genes tested, we identified 1,591 probe sets (Supplemental Table 1, available at the Physiological Genomics web site)1 with significant association between expression level and DSO (P < 0.05), of which 1,185 probe sets displayed no association with background (P > 0.05 for BG and P > 0.05 for DSO*BG) (Fig. 2). To estimate the false-discovery rate due to this stage of the analysis, we determined the effect of correcting for multiple hypothesis testing. Correction for multiple testing with the Benjamini-Yekutieli procedure to control the false-discovery rate reduces the number of probe sets demonstrating significant association with DSO to 483 (45). This includes 40 probe sets representing transcription factors and coregulatory proteins (Table 1). However, because our approach combined independent experimental validation techniques, we chose relaxed thresholds (P < 0.05) without correction for multiple testing to determine the overall sensitivity of the subsequent validation steps. A comparison of the contribution of each variable to the overall variability revealed that DSO contributes the most, followed by BG, and then by their interaction (Fig. 3). We note that 1,213 genes display significant association (P < 0.05) between expression level and BG, suggesting the presence of elements located distant from the affected transcript. We further note 651 genes displaying significant association (P < 0.05) between expression level and the DSO-BG interaction term. This suggests that expression of the transcript is regulated by a combined effect of genetic variants located both distant to and within proximity to the affected transcript.
To evaluate whether the results of our expression analysis corresponded to genetic differences, we determined the SNP content at selected gene loci by sequencing 1 kb of 3′ UTR for genes selected randomly from those negative and positive for DSO association. We resequenced 50 probe sets that showed association with DSO as well as 80 probe sets that showed no association with DSO. These were selected randomly from lists of probe sets that had been filtered to exclude transcripts that were technically unsuitable for evaluation in the sequence-based AI assay (see materials and methods). In total, 234 randomly selected DSO-negative probe sets and 101 randomly selected DSO-positive probe sets were evaluated to select the validation set. We do not feel that this reflects a significant difference in the structure or expression level of the genomic loci underlying the DSO-positive and DSO-negative probe sets, as none of the selection criteria discriminated between equivalent numbers of selected DSO-positive and DSO-negative probe sets and none displayed a trend with the P-value for DSO-expression association (results not shown). In addition, in contrast to a previous report showing that transcript levels measured with oligonucleotide-based expression microarrays may be reduced for RNA targets that contain SNPs within the oligonucleotide probe sequences (15), analysis of the expression data obtained in this study suggests that hybridized probe intensities displayed more variability across the data set for SNP-containing probes but that the summary expression measures showed no systematic bias in favor of one allele (see Supplemental Analysis).
On the basis of the known ancestry of the parental lines used in this study (19, 56), we hypothesized that the SNP content should be higher at loci showing positive association if their expression levels were regulated by cis-acting genetic variants. To assess this, we obtained sequences from the 3′ UTR as we expected that this region would show a higher rate of polymorphism than coding sequences. Our sequencing results showed a significantly increased occurrence of SNPs in genes positive for association between DSO and expression (Tables 2 and 3), with 50% of genes with positive DSO association vs. 28% of negative genes containing SNPs (P < 0.01, Fisher exact test). By logistic regression we observed a trend (P < 0.01) between increasing likelihood for SNP occurrence and P-value for association between gene expression association and DSO in the ANOVA model—the odds ratio associated with a 100 times smaller DSO P-value is 1.89, with a confidence interval of 1.17–3.03. This observed correlation between significant DSO-expression association and SNP occurrence concurs with previous results demonstrating cis-acting expression quantitative trait loci (QTL) within regions that are not identical by descent (15).
To determine whether cis-acting genetic variation was associated with differences in gene expression levels, we used a sequence-based assay to measure AI for genes where we had previously identified 3′ UTR SNPs. These assays were performed with RNA obtained from the lungs of five replicate F1 mice generated from an A/J × C57BL/6J cross, which places loci with A/J or C57BL/6J DSO status in a common genetic background (Fig. 4). Of the 32 positive and 23 negative genes containing SNPs, 17 (50%) positive genes displayed AI vs. 5 (22%) negative genes (P < 0.05, Fisher exact test). The presence of AI also displayed dependence on the P-value for DSO-expression association by logistic regression (P < 0.05), suggesting that the likelihood of AI increases with higher significance of association between expression and DSO in the ANOVA model—the odds ratio associated with a 100 times smaller DSO P-value is 1.92, with a confidence interval of 1.04–3.53. Results for SNP discovery and AI are summarized in Table 4.
The genome-wide identification of cis-regulated genes will provide a key resource in the search for genetic determinants of complex traits as well as a significant prioritization tool to identify candidate genes. Recent studies of gene regulation in yeast have identified networks of gene regulatory interactions (4, 50). This progress parallels studies of gene regulatory network dynamics, where the topology of the regulatory interactions is described in terms of motifs and modularity (1, 24, 33, 60), as well as chromatin immunoprecipitation studies (29, 46, 51, 54). Although these results suggest that the effects of a fixed cis-regulatory polymorphism may be significantly modified by the control hierarchy in which it is embedded, they also suggest that the link between cis-regulatory polymorphisms and disease states may only be apparent in specific genetic or physiological contexts. By employing an experimental system (RCS) and applying a relatively simple analytical strategy, this study defines a method for identifying cis-acting variation affecting gene expression with respect to background genetic context. In combination with an independent experimental validation AI method, this approach represents an opportunity to efficiently survey cis-acting regulatory variation across the genome.
The percentage of genes found to have AI among those showing an association between expression and DSO (50%) represents a 10- to 20-fold enrichment for detection compared with random screening of genes (10, 42). We note that the detection of association between expression and DSO was greatly enhanced by our ability to adjust for the effect due to the predominant (background) genome derived from the A/J and C57BL/6J parental lines. In agreement with a previous study (7), the transcription factor Runx1, known to be subject to cis-regulation, displayed the highest degree of association by our expression analysis and was confirmed by AI. We note that although our analysis did not identify all of the same candidates as the previous studies in recombinant inbred strains (21), our results confirmed 25 of 75 cis-regulated transcripts described previously (8). Because both tissue-specific effects as well as the specific inbred strains used in these studies may account for these differences, it would be interesting to validate our findings with other RCS or recombinant inbred panels. Although genes displaying AI are more likely to contain cis-acting regulatory variants (41), identification of the causative polymorphisms and mechanisms leading to AI remains to be confirmed empirically.
The RCS expression data set, combined with the F1 studies of AI, allows us to estimate that 8–11% of genes are affected by cis-acting regulatory variation in lung tissue among these strains. This percentage exceeds previous estimates across three tissues and four mouse inbred strains, where AI was seen in 3–6% of randomly selected genes (10), but agrees with a recent study comparing a cross of C57BL/6J and DBA/2J strains (15). We feel that the estimated 8–11% of genes subject to cis-acting regulation possibly represents a lower limit for a number of reasons. First, this study focused on a single tissue (lung) at one developmental stage (adult), whereas cis-regulation is known to act in a tissue-dependent fashion (10) and transcriptional changes are abundant throughout development (3, 12, 47). Second, a significant proportion of genes showed association between expression and DSO but did not demonstrate detectable AI. Although most of these may be true negatives in regard to cis-acting differences at the gene locus, weaker cis-acting effects may have remained undetected by the sequence-based AI assay; some cis-acting regulatory variants may have been masked because of complex control mechanisms happening in the F1 animals used to detect AI. Third, our design excluded genes falling within segments containing a recombination because the positions of recombination sites were not characterized. Although this affected a small percentage of results dispersed throughout the data set (6%), this exclusion together with varied levels of heterogeneity in the data set could have led to insufficient power to evaluate a number of genes on the microarray. Fourth, our analysis methods assumed independence for each locus tested; this may not be appropriate because dependencies exist among genes that are colocated in the same chromosome region (27). An analytical strategy involving more traditional techniques such as QTL mapping (30) would take into account the dependence on location as well as recombination rate throughout the genome. Finally, our results suggest that a proportion of genes with cis-acting variants do not display expression differences that are detectable across the RCS panel. We observe that five genes negative for association (P > 0.05 for DSO) display AI. This could result if the expression measurements were not sensitive enough to detect subtle changes in gene expression, which would be anticipated for tightly regulated transcripts. Alternatively, the expression results may have been confounded by one or more collinear variables, such as the contribution of ancestral haplotype variability as suggested by the differential SNP discovery rate (57, 58), or epistatic trans-interactions. Given the extent of gene interactions estimated to exist, collinear variables are more likely the case than the exception (4).
We note the substantial contribution of the background genetic composition to the overall expression variability observed across the strains; over 1,200 genes displayed significant associations with BG (P < 0.05). These observations indicate the presence of determinants affecting expression variability that are distant from the affected gene, suggestive of trans-acting regulatory mechanisms. Furthermore, a large number of genes (347) displayed significant associations of transcript abundance simultaneously with both DSO and background terms, suggesting that they are affected simultaneously by independently cis-acting and trans-acting regulatory variation. In addition, 651 genes displayed P < 0.05 for DSO*BG interaction, indicating the existence of dependencies between cis and trans effects. Examples of genes simultaneously affected by cis-acting regulation and trans-acting modifier genes are numerous (22, 31, 32, 37, 38). These results confirm the prevalence of such effects and highlight the importance of assessing the contribution of genetic context when measuring gene expression phenotypes.
This research was supported by grants from the Canadian Genetics Diseases Network, the Canadian Institutes for Health Research, the Mathematics of Information Technology and Complex Systems (Networks of Centres of Excellence Program), Genome Canada, and Genome Quebec. T. J. Hudson is supported by a Clinician-Scientist Award in Translational Research by the Burroughs Wellcome Fund and an Investigator Award from the Canadian Institutes of Health Research. The recombinant congenic strains were supported by Emerillon Therapeutics Inc., Genome Canada, and Genome Quebec.
We thank Scott Gurd and Andre Ponton in the Microarray Facility at the McGill University and Genome Quebec Innovation Centre for contributions to the project, as well as Jean-Marie Chavannes and the animal care technicians at the McGill University Health Centre.
↵1 The Supplemental Material for this article (Supplemental Table 1 and Supplemental Analysis) is available online at http://physiolgenomics.physiology.org/cgi/content/full/00168.2005/DC1.
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).
Address for reprint requests and other correspondence: R. Sladek, 740 Dr. Penfield Ave, Rm. 6214, Montreal, QC, Canada H3A 1A4 (e-mail:).
- Copyright © 2006 the American Physiological Society