Despite a growing number of reports of gene expression analysis from blood-derived RNA sources, there have been few systematic comparisons of various RNA sources in transcriptomic analysis or for biomarker discovery in the context of cardiovascular disease (CVD). As a pilot study of the Systems Approach to Biomarker Research (SABRe) in CVD Initiative, this investigation used Affymetrix Exon arrays to characterize gene expression of three blood-derived RNA sources: lymphoblastoid cell lines (LCL), whole blood using PAXgene tubes (PAX), and peripheral blood mononuclear cells (PBMC). Their performance was compared in relation to identifying transcript associations with sex and CVD risk factors, such as age, high-density lipoprotein, and smoking status, and the differential blood cell count. We also identified a set of exons that vary substantially between participants, but consistently in each RNA source. Such exons are thus stable phenotypes of the participant and may potentially become useful fingerprinting biomarkers. In agreement with previous studies, we found that each of the RNA sources is distinct. Unlike PAX and PBMC, LCL gene expression showed little association with the differential blood count. LCL, however, was able to detect two genes related to smoking status. PAX and PBMC identified Y-chromosome probe sets similarly and slightly better than LCL.
- system biology
- biomarker discovery
- fingerprinting genes
- data normalization
- X-linked expression
- cardiovascular disease
owing to accessibility, practicality, and minimal invasiveness, blood-derived RNA sources, such as lymphoblastoid cell lines (LCL), whole blood cells (PAXgene tubes; PAX), and peripheral blood mononuclear cells (PBMC), have been widely used in gene expression studies for biomarker identification (30, 32) and pathway profiling (10). These RNA sources have been useful for identifying biomarker signatures of lupus (9), cancer (75), and bacterial infection (14). This makes blood-derived RNA valuable even when studying diseases involving remote target tissues (66).
Each of these blood-derived RNA sources is known to have inherent characteristics that will result in a unique gene expression profile (25). PAX samples, derived from whole blood, capture RNA profiles of all cell types in whole blood, including erythrocytes, granulocytes (neutrophils, eosinophils, basophils), lymphocytes, monocytes, and platelets. PBMC samples, derived from a Ficoll-filtered lymphocyte and monocyte subset, are largely devoid of granulocytes, platelets, and reticulocytes. LCL samples, derived from lymphoblastoid cell lines [i.e., B cells infected and immortalized by Epstein-Barr virus (EBV), stored frozen and regrown several years after sample collection], represent RNA from a single cell type. In addition, gene expression differences may also arise from varying RNA isolation protocols and sample handling (6, 25, 50, 80).
Despite a growing number of reports of gene expression analysis from these RNA sources, there have been few systematic comparisons of their suitability for biomarker discovery, especially in the context of cardiovascular disease (CVD). Previous studies (50, 67) have examined gene signature differences among these RNA sources. One study examined the expression profile differences among the sources with respect to age and sex (80) in a spotted-array platform. However, none of these studies has a balanced experimental design that can eliminate certain statistical biases in the analysis.
Therefore, the primary goal of this study, which was undertaken as a feasibility study for the Systems Approach to Biomarker Research (SABRe) in CVD Initiative, was to characterize three blood-derived RNA sources, LCL, PAX, and PBMC, for quantity and quality of RNA, and expression properties using an exon-array platform in a balanced experimental design. The performance of these sources was assessed with regard to identification of differential expression of Y-chromosome probe sets with sex, which is a major CVD risk factor (35). Beyond sex associations, associations of expression with other risk factors, such as age, smoking status, and high-density lipoprotein (HDL) cholesterol level, were also explored. In addition, complete blood counts (CBC) (19) obtained at the time of blood collection allowed tests of association of expression with blood cell proportions.
The balanced experimental design permitted a secondary goal of identifying genes whose expression levels are stable across RNA sources within individuals yet highly variable across the population. Such markers may be useful in fingerprinting the samples for forensic identification or in resolving sample mix-ups, which is a common problem in gene expression studies. Last, we also identified genes that are consistently expressed across multiple RNA sources and across individuals, making them suitable for use as calibration markers.
MATERIALS AND METHODS
The first cohort of the Framingham Heart Study (FHS) included 5,209 men and women between 30 and 60 yr of age who enrolled in 1948 and have undergone biennial examinations (24, 54, 55). In 1971, 5,124 children (spouses of children) of the original cohort were recruited to the Framingham Offspring Study (27). In 2002, 4,095 participants were included in the third generation cohort (71). Blood samples were obtained from 50 consecutive participants from the third generation cohort who attended their second examination cycle clinic visit in January 2009. Immortalized cell lines for these same participants were prepared from samples taken during their initial clinic visit 1, ∼5 years earlier. To investigate for possible sample storage effects, we obtained 24 whole blood samples from the offspring cohort which were sampled in 2005–2006 and stored for 3–4 yr at −80°C prior to RNA preparation. Protocols for participant examinations and collection of genetic materials, including immortalized cell lines, were approved by the Boston University Medical Center Institutional Review Board.
Individual Trait Data
Current smoking status (defined as regularly smoking one or more cigarettes per day during the past year), systolic and diastolic blood pressure (seated, measured twice in the left arm by a physician), total and HDL cholesterol levels, fasting blood glucose level, and body mass index (BMI, weight in kg divided by height in m2) were obtained at the clinic visit. Hypertension was defined a systolic blood pressure of at least 140 mmHg or a diastolic blood pressure of at least 90 mmHg or current use of antihypertensive medication. Diabetes was defined as fasting blood glucose of at least 126 mg/dl or current use of insulin or an oral hypoglycemic medication. CBCs were obtained on samples collected from the third generation at the second examination clinic visit using a Beckman Coulter Counter (Beckman Coulter, Brea, CA).
RNA Isolation and Target Labeling
The three RNA sources collected on each of the 50 consecutive individuals included PAX and PBMC (obtained at the second clinic examination of the third generation cohort), and LCL, obtained at the first clinic examination ∼5 yr earlier.
Blood Specimens (2.5 ml) collected in PAXgene tubes from each participant were incubated at room temperature for 4 h for RNA stabilization and then stored at −80°C. RNA was extracted from whole blood using the PAXgene Blood RNA System Kit following the manufacturer's guidelines (62). In brief, samples were removed from −80°C and incubated at room temperature for 2 h to ensure complete lysis. Following lysis, the tubes were centrifuged for 10 min at 5,000 g, the supernatant was discarded, and 500 μl of RNase-free water was added to the pellet. The tube was vortexed thoroughly to resuspend the pellet, centrifuged for 10 min at 5,000 g, and the entire supernatant was discarded. The pellet was resuspended in 360 μl of buffer BR1 by vortexing and RNA was further purified with on-column DNase digestion. Quality of the purified RNA was verified on an Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA); RNA concentrations were determined using a NanoDrop ND-1000 spectrophotometer (NanoDrop Technologies, Wilmington, DE). We amplified 50 ng of total RNA using NuGEN's WT-Ovation Pico RNA Amplification System and labeled it with FL-Ovation cDNA Biotin Module V2 (NuGEN, San Carlos, CA) according to the protocol provided by the supplier.
Venous blood (8 ml) from each participant was collected into Vacutainer cell preparation tubes containing sodium citrate and Ficoll (Becton Dickinson, Franklin Lakes, NJ). Purified PBMC suspensions were resuspended in RLT buffer (700–1,000 μl per 107 cells), passed through Qiashredder columns (Qiagen, Valencia, CA), and then stored at −80°C. Total RNA (50 ng) was amplified and labeled using the Affymetrix Whole-Transcript (WT) Sense Target Labeling Protocol without rRNA reduction. Complementary DNA (cDNA) was regenerated through a random-primed reverse transcription using a dNTP mix containing dUTP. The RNA was hydrolyzed with RNase H, and the cDNA was purified. The cDNA was then fragmented by incubation with a mixture of UDG and APE1 with restriction endonucleases and end-labeled via a terminal transferase reaction incorporating a biotinylated dideoxynucleotide.
Total RNA was extracted from pelleted lymphoblastoid cells of each participant using the Qiagen RNeasy Plus extraction kit according to the manufacturer's protocol (61). The process included a column-based elimination of genomic DNA. Total RNA (50 ng) was amplified and labeled using the Affymetrix Whole-Transcript (WT) Sense Target Labeling Protocol without rRNA reduction. cDNA generation, RNA hydrolysis, fragmentation, and labeling were carried out with the same protocol as described above for PBMC samples.
We added 5.5 μg of the fragmented, biotinylated cDNA prepared from each of the whole blood, PBMC, and cell line samples to a hybridization cocktail, loaded it on an Affymetrix Human Exon 1.0 ST GeneChip, which contains ∼1.4 million probe sets in total, and hybridized it for 16 h at 45°C and 60 rpm (3). Following hybridization, the array was washed and stained according to the manufacturer's protocol. The stained array was scanned at 532 nm using an Affymetrix GeneChip Scanner 3000, generating CEL files for each array. Aside from 10 samples (4 LCL, 3 PAX, 3 PBMC) with insufficient RNA, all samples were chipped in two batches.
Expression Data Analysis
We applied the robust multichip average (RMA) method (34) to normalize expression values for the remaining 140 samples using the Affymetrix Power Tool (APT) (4) version 1.12.0. We used the following metrics (2) to determine the quality of the hybridized samples: all_probeset_mean, all_probeset_rle_mean, pm_mean, and pos_vs_neg_auc. Two LCL and five PBMC samples failed on these metrics, and were excluded, along with the other samples from the same individuals. After inspecting Y-chromosome probe sets for agreement across the samples from each individual, we additionally removed two PAX and two PBMC samples due to apparent mislabeling. This left 35 individuals with satisfactory results for all three sample types, giving 105 samples. We repeated probe set-level RMA normalization on these samples, retaining only core-level, RefSeq-annotated probe sets, giving 287,329 probe sets in all, representing 18,282 distinct genes. We also performed transcript cluster-level normalization at the “core level,” giving 17,330 RefSeq-annotated genes. The gene counts and annotations are based on Affymetrix NetAffx release 31 (1).
Having discarded samples from participants 21, 23, and 35 due to insufficient RNA, we normalized the CEL files of the remaining 140 chipable samples with the RMA method using the APT software in three runs, one per RNA source. The quality control parameters of these samples are shown in Supplementary Table S13.1 Since participant 45 did not yield sufficient RNA in its LCL sample, its PAX and PBMC samples were discarded. PBMC samples of participants 2 and 8 and PAX samples of participants 43 and 44 were found mislabeled by the inspection of Y chromosome probe sets. These four samples were discarded. LCL sample of participant 18 was identical to that of participant 17, and LCL sample of participant 42 was identical to that of participant 43. Only LCL sample from participant 18 could be restored. We select only participants with samples having all probe set RLE mean of at most 0.75. This step removed samples from participants 37, 38, 40, 41, 46, and 47. We renormalized the 105 CEL files altogether from the remaining 35 patients.
To address an apparent systematic bias in gene expression values between PAX results and either PBMC or LCL (Fig. 1A), likely arising from the differences in labeling protocol, we further normalized the data with the S10 postnormalization procedure (52), a variance-stabilizing and quantile-normalizing transform. In addition to S10, we also considered quantile postnormalization (QPN), which is a quantile-normalization transform. We will choose the transform that minimizing variance across participants.
Using QPN, we computed the mean value of each probe set per RNA source, yielding three sets of mean values. We chose PBMC as the reference distribution because its mean values correlate well with those of LCL and PAX. Such selection is aimed to minimize drastic quantile correction. After the mean value of each gene for LCL and PAX was quantile-normalized against that of PBMC, its individual expression values were shifted by the difference between the original and normalized mean values.
For S10, we computed the anti-log of the RMA expression values, calculated the normal quantiles, then computed mean and standard deviations across samples, and then fit a spline to the standard deviation as a function of the mean. A variance-stabilizing transform function is computed from this smooth function, and then applied to the data. Finally, the log base 2 was computed on the normalized data.
After postnormalization, the QPN-transformed mean densities were identical to that of PBMC, while those S10-transformed were diagonally aligned (Fig. 1B). Using two-way ANOVA with RNA source and participant as fixed factors, we determined to use S10 because it minimizes variance across participants while normalizing the quantiles.
All statistical analyses were performed both at the exon/probe-set level and at the gene/transcript cluster level using R (63) version 2.11.1 or JMP 9 (SAS, Cary, NC). The MSCL Analyst's Toolbox (freely available at http://abs.cit.nih.gov/MSCLtoolbox/) was used for initial exploratory analyses and feature discovery. We discarded 92,157 probe sets where the intensity of 52 or fewer (≤50%) of the 105 samples was significantly above background [i.e., with detected-above-background (DABG) P values ≤ 0.05], leaving 195,172 for subsequent analysis. DABG filtering was applied to exon-level data and not to gene-level data. In all cases, we calculated the false discovery rates (FDR) with Benjamini and Hochberg's method (11).
To determine the separation of expression patterns across cell types, we performed principal component analysis (PCA) on all 105 arrays on DABG-filtered exon-level data. The PCA was performed using the “prcomp” function (76) of R on centered, but unscaled data.
To determine differentially expressed genes between each pair of RNA sources, we used a two-way ANOVA with fixed factors for sample type (n = 3) and participant (n = 35). We counted probe sets and transcript clusters where mean expression differences among RNA sources were declared significant based on the sample type F-statistic (FDR ≤ 0.05). Comparison of expression between pairs of RNA sources used a post hoc t-test statistic with the same FDR threshold. To identify genes that were uniquely overexpressed in each RNA source compared with the other two sources, we computed the minimum fold-change for each paired comparison and required this to be greater than eightfold.
Stable fingerprinting genes/exons are those with expression levels strongly related to participant, irrespective of RNA source. Using the same two-way ANOVA, we selected such genes/exons with a significant participant effect at FDR ≤ 0.05. A subset of the most significant exons having participant-effect standard deviations of at least twofold change were clustered using Ward's method (53) on their expression level, after subtracting the sample-type effect. Conversely, housekeeping or calibration genes were selected as those with the smallest variation across sample type or across participant. We selected such genes with 1) P value for sample type >0.2, 2) P value for participant >0.2, 3) standard deviation of the within-participant effects less than twofold change, and 4) mean expression level greater than background threshold (4 RMA units).
Transcript profile associations with age, sex, and selected CVD risk factors.
To discover genes that are expressed differentially in men vs. women, we performed a two-sample t-test with unequal variance assumption for each RNA source. We required the exons and genes to pass the FDR ≤ 0.05 threshold.
For trait-based biomarker discovery, we regressed the RMA expression of each gene for each RNA source against the trait, adjusting for age and sex, using the linear mixed-effects model implemented in the R package “lmer” (59). Owing to small sample size, we relaxed the multiple-testing penalty by setting a P value cutoff of 0.05 and selecting only genes with a significant association in more than one RNA source. For confirmatory testing of previously identified biomarkers using our data, we regressed the RMA expression of each exon against the trait, adjusting for age and sex using P value ≤ 0.05.
Differential blood count analysis.
Because blood is a complex tissue made up of varying proportions of several cell types, each with a distinct expression profile, expression of some genes would be expected to vary proportionally to these components. To find such genes, we used a multiple regression model with factors for the absolute count per μl of each measured component: red blood cells, platelets, neutrophils, lymphocytes, monocytes, eosinophils, and basophils. We collected the significance levels (P values) for each factor and performed FDR adjustment as above. We also relaxed the FDR cutoff to 0.2, since few results were obtained at lower levels. The test was repeated for each RNA source.
Gene ontology analysis.
We performed gene ontology (GO) (7) enrichment analysis of the differentially expressed genes between each pair of RNA sources based on exon-level or gene-level data using GOrilla (26). This method determines whether the number of differentially expressed genes having a particular GO assignment is significantly higher than would be expected by chance, given the total number of genes, the total number having that assignment, and the number of differentially expressed genes, overall. We removed unannotated genes and required the remaining genes to pass 1) an FDR ≤ 0.05 threshold and 2) at least a fourfold difference in expression. We ran GOrilla using genes in the Affymetrix NetAffx core-level annotation version 31 for the Human Exon 1.0 ST GeneChip as the background set of genes.
RNA Source Comparison
The clinical characteristics of the study sample are provided in Table 1. PCA revealed striking differences in expression patterns of the three RNA sources (Fig. 2). The first two principal components, attributable to RNA source differences, accounted for 70.88% of overall variance in expression. The PCA plot of the 24 older PAX samples coincide with the newer PAX samples. Therefore, the striking differences among the three RNA sources are much larger than any possible sample storage or aging effects.
About 90% of probe sets (176,641 of the 195,172 expressed above background) were found to differ across RNA sources (FDR ≤ 0.05). Even when an exceedingly low FDR cutoff (≤1×10−8) was set, more than half the exon probe sets (105,709) differed significantly across RNA sources (Table 2). A similar percentage showed expression differences at the gene level. Most of these expression differences were seen in the PAX vs. LCL and PAX vs. PBMC comparisons. Genes that are uniquely overexpressed by ≥8-fold in each RNA source compared with the other two sources were ranked by level of overexpression, and the topmost are presented in Tables 3⇓–5. The corresponding tables based on exon-level analysis are given in Supplementary Tables S1–S3.
Seven red blood cell-related genes were overexpressed in PAX compared with the other two RNA sources (Table 3) including the top-ranked GYPB (glycophorin B) gene. Hemoglobin and hemoglobin-related genes, including HBB, HBD, ALAS2, RHD, and AHSP are seen in high-ranking positions. Many known erythrocyte-related genes, such as HEMGN, EPB42, EPB49, and SLC4A1, were also significantly higher in PAX, but were seen above the eightfold cutoff only in the exon-level analysis. These observations clearly result from the fact that PAX samples are derived from whole blood, which comprises predominantly erythrocytes and reticulocytes as well as white blood cells. Genes associated with neutrophils such as SLPI were also overexpressed in PAX compared with LCL or PBMC (Table 3). Most of these genes were also moderately expressed in PBMC but markedly lower in LCL expression. The top GO (7) categories (see Table 6) for genes most highly expressed in PAX are related to RNA splicing and processing.
The top 32 genes specific to LCL (Table 4) are rich in cell cycle-related genes. For example, CDK1 (cyclin-dependent kinase 1) and CCNB2 (cyclin B2) are 68- and 65-fold overexpressed in LCL compared with the other two RNA sources. Top GO categories for genes most highly expressed in LCL (see Table 6) are related to cell cycle and mitosis, which are indicative of a cell line undergoing rapid cell division. Several genes known to be induced by EBV were also overexpressed in LCL. Of these, EBI3 (Epstein-Barr induced 3) was the most highly differentially expressed. Fifteen other known EBV-induced genes show significant overexpression of two- to sixfold. In general, exon-level analysis was more sensitive than gene-level analysis in identifying such genes (Supplementary Table S2). Comparison of LCL with PAX expression appeared to be generally more sensitive to EBV-induced differences than the comparison with PBMC (e.g., CR2, Table 4).
PBMC overexpression (Table 5) was seen in many genes known to be platelet specific, or involved in coagulation. For example, P2RY12, THBS1, ITGB3, PTGS1 were abundantly expressed in PBMC vs. PAX and PBMC vs. LCL. Evidently, the inclusion of platelets within the PBMC fraction is sufficient to for allow detection of these genes (64). The top GO categories (Table 6) for genes most highly expressed in PBMC are related to immune response, platelet activation, and blood coagulation reflecting the primary presence of lymphocytes and monocytes, and some platelets, in this sample type.
Analysis of Differential Blood Count Data
We were able to identify the associations of numerous genes with individual blood elements in the differential blood count (Table 7). In PAX samples, most of the genes with positive associations were associated with neutrophil or lymphocyte counts, while in PBMC, the genes were generally associated with lymphocyte (36, 70, 73) and monocyte counts, as would be expected based on the cell-type composition of these sources. Some of the neutrophil-associated genes, such as SLPI and IL1R2, are also reported in Table 3. As before, the exon-level analysis often detected more genes than did the gene-level analysis. GO analysis of these genes showed overrepresentation in the categories of immune system regulation, lymphocyte, leukocyte, and T-cell activation (Supplementary Tables S4–S6). In contrast, no genes were associated with the differential blood count in RNA from LCL. This may be due to the single cell type represented in LCL or because the LCL samples were derived from whole blood obtained 5 yr prior to the differential blood counts whereas PAX and PBMC were drawn at the same time as the blood counts were performed.
Identification of Genes Associated With Major CVD Risk Factors
Not surprisingly, a search for biomarkers of sex in our study yielded many Y-chromosome genes (Table 8). All three sample types identified 128 exons within nine distinct genes residing on the Y-chromosome at FDR ≤ 0.05 level. An additional 28 exons on 11 Y-chromosome genes were detected in one or more of the RNA sources, with PAX able to detect 16 exons, PBMC 15 exons, and LCL 9 exons at this FDR level. Only 14 exons of two X-linked genes [KDM5C, KDM6A, lysine (K)-specific demethylase 5C and 6A] were differentially expressed in women vs. men in all three RNA sources. However, 142 exons in 22 genes (Table 8) showed differential expression in at least one source. Several of these genes are obvious homologs to their Y-linked counterparts (DDX3X, EIF1AX, NLGN4X, PRKX, RPS4X, ZFX). Interestingly, in LCL samples more X-linked overexpression in women was detected, with 99 additional exons beyond those detected in all three sources, compared with 28 for PAX and 25 for PBMC. The key gene responsible for X inactivation (XIST), which is ordinarily highly overexpressed in women, is less overexpressed in women in LCL samples (Supplementary Table S7), compared with PAX or PBMC (female to male fold-change of 25 in LCL, 140 in PAX, and 52 in PBMC). Furthermore, we observed that XIST expression in LCL in women is significantly correlated with 339 of 648 X-linked genes (FDR ≤ 0.2 genome wide). The majority (206) are negatively correlated with X-linked expression, further supporting the idea that XIST-mediated X inactivation is substantially and variably disrupted by EBV infection/transformation and/or culture conditions of the LCL samples.
There were 90 autosomal exons (74 genes) associated with sex at FDR ≤ 0.05, none was significant in more than one RNA source (Table 9). PBMC identified exons from 52 genes, while LCL and PAX identified 19 genes each.
Several probe sets in the CHRNA3 (cholinergic receptor, nicotinic, alpha 3) gene were downregulated in smokers in all three RNA sources in our study, though not significantly (P < 0.10). PAX and PBMC samples showed a stronger tendency toward downregulation (P = 0.06) on probe set ID 3634334. Variants of CHRNA3 have been associated with smoking behavior and susceptibility to lung cancer (5). Genetic variants in ALDH2 (aldehyde dehydrogenase 2) have been studied extensively in relation to smoking and lung cancer risk (57). In our study, LCL samples detected 1.36- to 2.84-fold lower expression of this gene in smokers, with 14 of 16 probe sets having P values < 0.05. Average expression of all 16 exons differed significantly in LCL samples (P = 0.004) was borderline for PBMC (P = 0.054) but did not differ for PAX (P = 0.247).
The relatively narrow age range of the participants hindered biomarker detection for age. Nevertheless, five genes were associated with age (P < 0.05) for each of the three RNA sources (Supplementary Table S8). One of them, TP53, has been associated with senescence (28). The magnitude of expression differences was small with only three out of five genes having the same directional difference in all three RNA sources.
HDL cholesterol levels.
Since the small sample size hindered discovery of gene expression signatures of HDL cholesterol, we sought to confirm previously observed associations with HDL. Four such genes were seen to be associated in PAX at P < 0.05, four were associated in PBMC and none in LCL (Supplementary Table S9). Two genes, FADS1 (fatty acid desaturase 1) and LDLR (low-density lipoprotein receptor), were associated with HDL levels in both PAX and PBMC, with small but consistent inverse associations of higher expression with lower HDL. These genes are known to influence circulating lipid levels and risk of coronary artery disease (78). The remaining CVD risk factors listed in Table 1, including BMI, total cholesterol, and blood pressure, were analyzed but did not reveal any significant association with gene expression.
Robust and Consistent Markers
We identified a number of exons that strongly distinguished individual participants, irrespective of RNA source. These fingerprinting exons have robust expression levels (i.e., their relative expression is independent of RNA source) and may allow for identification of individuals within a large study sample.
We selected 423 such exons drawn from 247 distinct genes having statistical significance (Table 10, Supplementary Table S10). Among the top results were several histocompatibility antigen genes (HLA-DRB1, HLA-DRB5, HLA-DPB1, HLA-B, HLA-DQA2, HLA-DQB2). HLA genes code for antigenic surface proteins used by the immune system to recognize “self” and thus are highly specific to an individual's ancestry. These genes have been suggested as biomarkers for autoimmune diseases (56, 60). These 423 selected exons were able to cluster the three samples from each participant perfectly. Indeed, a subset of only 38 autosomal exons exhibiting the largest F-ratio for participant effects together with five exons on the X- and Y-chromosomes were sufficient to cluster the participants perfectly (Fig. 3). Of note, these fingerprinting markers include one exon of the β-actin gene (ACTB), commonly used as a calibration standard or housekeeping gene. This ACTB exon exhibits a strongly bimodal expression pattern (Fig. 4), possibly due to the influence of an underlying or associated SNP. A similar bimodal pattern is also seen in other probe sets (Fig. 5), such as exons of genes GSTM1, HLA-DRB1, and OAS1. OAS1, which encodes a protein vital to immune response to viral infection, is associated with multiple diseases (40) and contains common functional variation that strongly affects exon inclusion (58). In the case of GSTM1, the bimodal pattern is evident in eight consecutive probe sets covering seven distinct exons, suggesting a true pattern of bimodal expression or extensive splice variation, rather than the direct influence of a single SNP. GSTM1 is an important drug and xenobiotic metabolizing enzyme that is known to exhibit common copy number variation that likely contributes to the observed bimodal pattern of expression (33). The complete list of fingerprinting exons is given in Supplementary Table S10.
Stable “calibration” genes.
Conversely, we also searched for genes expressed above background (>4.0 in log2 RMA scale) and that had nonsignificant expression changes (<2.0-fold change, P value >0.2) across RNA sources and across participants. These genes would be valuable for batch corrections, meta-analysis across RNA sources or platforms, and for calibrating expression levels of transcripts of other genes (17). We found 139 genes meeting these criteria (Supplementary Table S11). Most are well-known and well-annotated protein coding genes. Many are known to be expressed in whole blood. Some of the most stable genes were CLCN6, TEAD3, ART5, COX6A2, SIRT5, ACTL6B, GPR50, GPR32, and RAB8B. Although these may not commonly be used as housekeeping genes, they are likely to be quite stable as calibration standards in future analyses using this platform. At the exon level, we found 1,544 exons representing 1,355 genes that passed similar selection criteria. Of these exons 25, representing 22 distinct genes, were common to the set selected at the gene level, including CLCN6, CSNK1G3, FAM48A, and RAB8B.
Each of the three RNA sources bears distinct characteristics, evident by the clear separation in the first two principal components (Fig. 2) and the finding that most genes were differentially expressed among the different sources (Table 2). Since most genes are expressed differentially across the RNA sources, their associations with each of the traits we studied are also different, warranting careful selection of the RNA source in a gene expression experiment. For the gene expression signature of sex, all three RNA sources yielded a large common subset of Y-chromosome genes strongly linked to sex. LCL samples were able to detect expression differences in X-chromosome genes between men and women, but this may be due to reversal of X-chromosome inactivation during EBV infection, cell immortalization, and culture. PBMC were better able to detect sex-linked autosomal genes than the other two RNA sources, although apparently none of the detected genes were also detected in prior studies (39), suggesting that our observation may be unique to our sample.
As cultured cells, LCL samples are less likely than PAX or PBMC samples to reflect in vivo expression changes. For example, LCL did not detect association between lymphocyte-related genes and lymphocyte differential counts. These findings, together with the perturbation of expression attributable to the EBV transformation process itself, suggest that LCL may be of limited value in identifying expression signatures of many health related traits. Prior work has shown limitations in the use of expression signatures in LCL due to their ex vivo status (16, 21). However, the ability of LCL to detect downregulation of ALDH2 in smokers suggests that epigenetic influences conditioned by the environment may still be encoded in LCL expression profiles.
It is important to note that a proportion of the differences observed between PAXgene and the other two sample types may be due to differences in preparation kits. As noted in materials and methods, PAX require a distinct preparation kit from that used for PBMC and LCL. However, by focusing on the minimal difference observed between each type vs. the other two (see Table 3), we attempted to report differences most likely attributable to underlying biological differences rather than simply due to technical sources. For example, the comparison of LCL with PBMC (which use the same preparation kit) shows very large differences for genes involved in cell-cycle pathways, as might be expected in transformed LCL cells.
Our study has several important advantages over prior studies. A balanced study design with three blood-derived RNA sources from each of 35 participants allows investigation of biomarkers and source-invariant genes to be undertaken more thoroughly. Indeed, few population-based expression studies include replicate samples in as many participants as are included here. This study includes multiple samples from the same individual, separated in time by as much as 5 yr. Expression patterns that persist across these samples are more likely to represent true stable phenotypes of the individual, than are those based on single, one-time measurements. Genes and exons showing variation in expression across the population, yet remaining consistent within the individual over years are likely to be enriched in useful expression biomarkers of risk factors or disease, compared with other genes. Furthermore, such genes and exons may be more likely to be associated with genetic factors (such as expression single nucleotide polymorphisms), than are genes having greater within-individual variation.
We showed that some of genes or exons showing variation in expression across our study sample can be used to distinguish individuals, suggesting that microarray expression data alone provide a personally identifiable fingerprint. In our study, only a tiny fraction of all exons distinguished individuals perfectly. This finding may prompt consideration of the identifiability of individuals within public microarray databases and whether safeguards are needed to protect their privacy. Conversely, we also provided result on stable and robust markers that may help researchers to calibrate their gene expression results. Calibration has been one of the major issues in gene expression analysis. We showed that conventional calibration genes, such as ACTB, may not be reliable.
We believe fingerprinting genes are useful in two contexts. First, in quality control of high-throughput assays, the identity of samples is sometimes questioned. Estimates of sample mix-ups often range up to 18% (79). If left unaddressed, this can introduce errors in the analysis and may possibly lead to the weakened or incorrect conclusion (47). Indeed some mix-ups were detected in the current study by aligning predicted sex based on Y-chromosome expression with that recorded in the database for the subject. When multiple samples from the same individuals are assayed, analysis of fingerprinting gene expression levels can be used to further identify mislabeled samples by clustering of such genes. The second context would be in searching for eQTLs (expression quantitative trait loci). A quantitative trait should be tightly coupled to the genome and recognizable regardless of when or in what tissue the gene expression level is measured. The set of fingerprinting genes are here shown to be stable within individual (in the small number of tissues tested) and over time (since the LCL cells were derived from an earlier blood draw, compared with the PAX and PBMC samples) and are thus good candidate quantitative traits. In searching for eQTLs, i.e., loci in the genome associated with quantitative traits, the fingerprinting genes, should be an excellent place to start. It has previously been noted that some genes are expressed in a bimodal fashion in the population (e.g., ACTB) and that a disproportionately large number of such genes have associations to disease (48). Many of our fingerprinting genes appear to express bimodally. Thus, it is reasonable to hypothesize that our fingerprinting genes might also contain a large fraction of genes (e.g., the HLA genes) related to disease or disease propensities.
Our study considered only blood-derived RNA sources, because this is one source likely to be widely available in a large population-based study. Although a desired tissue, such as brain in stroke patients, may be inaccessible, one can sometimes use blood as a surrogate, provided the relevant transcripts are similarly expressed in blood and brain. In certain situations (e.g., angioplasty, heart transplant, or coronary artery bypass graft surgery), it may be possible to obtain paired blood and heart tissue samples, from which the relevant transcripts expressed similarly in both can be determined. Accumulating such information will ultimately make blood-derived expression data in population-based studies more valuable in the future.
A larger sample size would have improved our power for biomarker discovery. The relatively narrow age range in this study likely prevented detection of extensive associations with age. In addition, analysis of many complex traits influenced by multiple genes each having modest effects (29) will require larger sample sizes. Larger sample size (or combining results of many studies) would have the additional benefit of further characterizing the measurement platform. The Affymetrix Exon array has ∼1.4 million probe sets, of which only about one-fourth were analyzed here. These probe sets were used because they correspond to well-annotated transcripts and have good performance characteristics. Many of the remaining probe sets have unknown performance characteristics or correspond to unannotated regions of the genome or to weakly annotated genes. Pooling experience from the growing number of published results on this platform will allow us to more sharply focus on the better-performing probe sets, while the general improvement of genome annotation will make other probe sets more useful in the future.
Although our pilot study was small and not intended for biomarker discovery, we were able to confirm associations of expression with lipid levels in two previously implicated genes, FADS1 and LDLR. While the observed effects were small, the magnitude, direction, and significance were consistent in PAX and PBMC samples, but not in LCL. This, again, suggests that LCL samples are less appropriate for detecting signatures related to health-related traits. The ability of even a small study to confirm associations with these well-established lipid-controlling genes lends optimism that more associations would be detected in a larger study, using either PAX or PBMC. Based on the results of this pilot, the larger, population-based SABRe in CVD Initiative will be using PAX as its RNA source and the Affymetrix Exon array platform. Completion of data collection is anticipated in late 2011.
The National Heart, Lung, and Blood Institute's (NHLBI's) FHS is supported by National Institutes of Health Grant NO1-HC-25195. The SABRe CVD Initiative is funded by the Division of Intramural Research, NHLBI, Bethesda, MD.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
No conflicts of interest, financial or otherwise, are declared by the author(s).
D. L and P. J. M. designed, directed, and supervised the experiment. D. L. was responsible for funding of the project. R. J. and P. J. M. drafted the manuscript. P. J. M., D. L., A. D. J., and C. J. O. revised and edited the manuscript. R. J. and P. J. M. performed the statistical analysis. J. J. B. performed S10 normalization of the data. N. R., P. L., and K. A. W. collected the data. All authors have read and approved the final version of the manuscript.
↵1 The online version of this article contains supplemental material.