Large-scale gene expression studies provide significant insight into genes differentially regulated in disease processes such as cancer. However, these investigations offer limited understanding of multisystem, multicellular diseases such as atherosclerosis. A systems biology approach that accounts for gene interactions, incorporates nontranscriptionally regulated genes, and integrates prior knowledge offers many advantages. We performed a comprehensive gene level assessment of coronary atherosclerosis using 51 coronary artery segments isolated from the explanted hearts of 22 cardiac transplant patients. After histological grading of vascular segments according to American Heart Association guidelines, isolated RNA was hybridized onto a customized 22-K oligonucleotide microarray, and significance analysis of microarrays and gene ontology analyses were performed to identify significant gene expression profiles. Our studies revealed that loss of differentiated smooth muscle cell gene expression is the primary expression signature of disease progression in atherosclerosis. Furthermore, we provide insight into the severe form of coronary artery disease associated with diabetes, reporting an overabundance of immune and inflammatory signals in diabetics. We present a novel approach to pathway development based on connectivity, determined by language parsing of the published literature, and ranking, determined by the significance of differentially regulated genes in the network. In doing this, we identify highly connected “nexus” genes that are attractive candidates for therapeutic targeting and followup studies. Our use of pathway techniques to study atherosclerosis as an integrated network of gene interactions expands on traditional microarray analysis methods and emphasizes the significant advantages of a systems-based approach to analyzing complex disease.
- systems biology
- gene expression profiling
- cardiovascular disease
- coronary arterial disease
the development of microarray technology has grown from modest beginnings to the present day, where the ability to expression profile whole genomes is routine. However, high-throughput gene expression profiling presents a unique difficulty in the need to identify and distinguish significant changes in gene expression from among the tens of thousands of genes that can be assayed simultaneously. The challenge of multiple testing has been met by many statisticians, but two approaches have become standard: significance analysis of microarrays (SAM) and hierarchical clustering. SAM determines statistically significant changes in gene expression by applying a modified t-test (54), controlling false discovery through a permutation technique. Hierarchical clustering applies statistical algorithms to group genes according to the similarity in gene expression pattern (10), where “similarity” is commonly defined by Euclidean distance or a correlation coefficient. Although these approaches are useful in identifying and distinguishing sets of statistically significant genes from the thousands of other genes on an array, they do not lend structure or integration to the results. Hierarchical clustering has been used as a pathway discovery tool (changes in expression of genes in activated networks are expected to correlate) (21), and there has been interest in extending this approach to model topological and dynamic properties of the networks that control the behavior of the cell, especially in organisms such as Escherichia coli (3, 25, 28, 32). Although successful in lower-order organisms, there has been limited application of such approaches to human disease. Furthermore, the pathway-based approach we present here can take into account prior knowledge by expanding the context beyond the genes and gene changes in the current experiment (14).
Atherosclerosis, an inflammatory disease stemming from genetic and environmental factors, is the primary disease of coronary arteries (27, 38, 41, 42). It is the number one killer in the United States (1) and every year claims more lives than the next five leading causes of death combined (1). The role of genetics in atherosclerosis pathophysiology has been recognized for some time: inheritance of risk factors was first shown in classical twin (11, 18, 20, 35) and family history studies (40, 56). Diabetes (4, 57), hypercholesterolemia (26), hypertension (57), obesity (7), smoking (57), and physical inactivity (2) are also known risk factors for the disease. Although interventional cardiology procedures such as balloon angioplasty, stenting (39), and atherectomy (44) have been successful in combating local coronary artery disease (CAD), this has not been met by equivalent success in interrupting the underlying disease at the molecular level (pleiotrophic effects of statins and aspirin notwithstanding). There is, for example, no treatment currently on the market designed to target the molecular interactions of the disease process itself.
Traditional approaches to the molecular study of atherosclerosis target one molecule or gene thought to have an important role in the development of the disease and then manipulate it through knockout, knockdown, or transgenic technology (13, 34). Although much has been learned about atherosclerosis through these studies, there remains a lack of understanding of the disease as an integrated whole. With this in mind, we undertook a comprehensive gene level assessment of coronary vascular disease taking a network, or pathway-based, approach to analysis. A customized microarray platform was designed to assay the expression profiles of vessels from human hearts explanted at the time of orthotopic heart transplant. Statistical tools mined the resulting data for differentially regulated genes, and further study was conducted with pathway generation and network analysis tools developed specifically for these data. Our findings have immediate significance for the investigation of pharmacological approaches to the prevention and interruption of atherosclerosis.
MATERIALS AND METHODS
Development of the Array Platform
In developing the array platform in an era that preceded “whole genome” arrays, our intention was to include the widest array of possible genes expressed in the cardiovascular system. To do this, we used a strategy that included sequencing clones from stimulated vascular cells in culture (to provide potentially unknown genes), searching the literature and databases, and combination with a substantial commercial clone set (Incyte).
Stanford clone set.
Human aortic smooth muscle cells (HASMCs) and human aortic endothelial cells (HAECs) (Clonetics; San Diego, CA) were serum starved and stimulated separately with 10 ng/ml TNF-α (R&D Systems; Minneapolis, MN). HASMCs were also stimulated with 3 ng/ml transforming growth factor (TGF)-β (R&D Systems) and 20 ng/ml PDGF-BB (R&D Systems). Cells collected at 30-min, 3-h, and 24-h time points were pooled for poly(A)+ RNA isolation, and suppression subtraction was performed as previously described (17). A total of 6,954 cDNAs was cloned into plasmid, miniprepped, sequenced, and matched to GenBank accession numbers, which were collapsed into UniGene clusters and annotated using RefSeq where possible.
A team of six investigators compiled lists of genes relevant to the cardiovascular system under subheadings that included “atherosclerosis,” “smooth muscle cell,” “endothelial cell,” “apoptosis,” “cytokine,” and “adhesion molecule.” Genes were drawn from PubMed and interest groups such as the Cytokine Family cDNA Database (http://cytokine.medic.kumamoto-u.ac.jp/). Genes were then identified within the National Institutes of Health (NIH) UniGene database with preference given to genes curated in RefSeq.
The Stanford clone set and search-output gene lists were then collated. All sequences were checked for quality with Phred and Phrap software and then masked for repeats and vectors (Repeat Masker: http://repeatmasker.genome.washington.edu/). Sequences were then compared using the BLAST algorithm on a local Linux server to the most up-to-date UniGene database. Hits with E scores < 20 were discarded. Hits on the curated RefSeq database were preferred. In the case of no hits, long expressed sequence tags (ESTs) with good quality were kept as sequences. After the literature and Stanford clone lists were combined, redundancy checking was carried out. Finally, an algorithm was written to check for splice variants. Our approach to splice variants was to use the consensus sequence wherever possible. To generate the 22-K feature array used for this study, 60-mer oligonucleotide sequences from the Agilent human 1A (V2) and human 1B arrays (Agilent; Palo Alto, CA) were used in cases where there was a match to sequences from the collated and filtered Stanford clone set and search-output gene list. In cases where gene sequences from the Stanford list were not represented on either of these arrays, Agilent's custom microarray probe design algorithms were applied to design custom 60-mer probes. The resulting customized array was printed using inkjet technology.
Major epicardial coronary arteries were carefully dissected from the explanted hearts of 22 patients undergoing orthotopic heart transplantation. Investigators were on call with the surgical team and collected the organ at the time of explant. After dissection, which occurred immediately after explant, vessels were dissected longitudinally to expose the endoluminal surface. Lesions were identified and scored by inspection through a dissecting microscope. Arteries were then divided into 1.0- to 2.0-cm segments macroscopically designated as disease or disease-free segments. Midportions of each segment were separated and stored both as frozen sections (Optimal Cutting Temperature) and in formalin for later embedding, sectioning, and staining. Both Institutional Review Board protocol approval and informed consent from transplant patients were obtained for this study.
Histology and Immunohistochemistry
Cryostat sections of 4 μm were fixed in acetone for 2 min, stained with hematoxylin for 30 s, stained with eosin for 5 s, dehydrated in graded alcohols, and coverslipped with permanent mounting solution after being cleared with xylene.
Immunohistochemistry was carried out using the following primary antibodies: CD14 (CBL453, Chemicon; Temecula, CA), α-actinin (MAB 1682, Chemicon), and IL-2 receptor-α (AB9496, Abcam; Cambridge MA). Frozen sections as described above were rehydrated with distilled water and then blocked with peroxidase (3%) and protein (PK-6102, Vectastain ABC Kit, Vector Labs; Burlingame, CA). Primary antibodies were titrated at 1:500 and 1:1,000 dilution at 200 μl/slide for 1 h at room temperature. After slides were washed, the secondary antibody (PK-6102, Vectastain ABC Kit) was added for 30 min at room temperature. Horseradish peroxidase (PK-6102, Vectastain ABC Kit) was added for 30 min at room temperature followed by diaminobenzidine (DAB) substrate for 5 min. Slides were counterstained with hematoxylin and dehydrated with graded alcohol.
American Heart Association Classification
In 32 samples out of our total 51 hybridized samples, we obtained histological grading according to American Heart Association (AHA) classification guidelines (50–52). Type I or initial lesions are recognized by an increase in the number of intimal macrophages and the appearance of lipid droplet-filled macrophages, or foam cells. Type III lesions are intermediate lesions that appear rich in lipid-laden cells and contain scattered collections of extracellular lipid droplets and particles that disrupt the coherence of some intimal smooth muscle cells. Type V or advanced lesions are rich in fibrous connective tissue. They usually contain a lipid core and thick layers of connective tissue. The prototypical lesion is designated as fibroatheroma or a type Va lesion. Advanced lesions with prominent calcification are designated as type Vb. Advanced lesions with little accumulated lipid, minimal or absent calcification, and a predominance of connective tissue are designated type Vc. We did not find samples with lesion types II, IV, or VI, and only two samples were classified as “no disease” (see Table 2) (51, 52).
RNA Isolation and Quantification
Total RNA was isolated from segments of coronary artery using a combination of TRIzol reagent (Invitrogen; Carlsbad, CA) and RNeasy Mini kit (Qiagen; Valencia, CA) techniques. One-centimeter-long coronary artery segments were flash frozen in liquid nitrogen and then placed into a spring-loaded hammer mechanism designed to crush frozen tissue into a fine powder by freeze fracture. The powder was collected, and TRIzol reagent was added before tissue homogenization with a 7 × 120-mm generator (Pro Scientific; Oxford, CT) attached to a fixed motor. After complete disruption of the tissue, chloroform was added, and the sample was shaken thoroughly before a brief incubation at room temperature. The samples were then centrifuged, and the supernatant (containing the RNA) was carefully removed without disturbing the cellular debris below it. The supernatant was placed into a fresh tube, and a mixture of Buffer RLT (Qiagen) and β-mercaptoethanol as well as 100% ethanol was added. The resulting solution was centrifuged through RNeasy Mini columns, allowing for binding of the RNA to the column membrane. Afterward, contaminants were removed by repeated wash spins of the membrane. The RNA was then eluted with RNase-free water into a new microcentrifuge tube.
Extensive quality control testing of all RNA was done before hybridization to the array. Samples were quantitated with a NanoDrop ND-1000 Spectrophotometer (NanoDrop; Wilmington, DE), which is able to detect small amounts of RNA with a high degree of sensitivity. RNA integrity was also assessed using Agilent's 2100 Bioanalyzer System and RNA 6000 Pico LabChip Kit. The Bioanalyzer is a microfluidics-based system that allows for rapid visualization of RNA quality and quantity even with extremely small amounts of RNA. The system detects biomolecules intercalated with fluorescent dye by laser-induced fluorescence.
Direct Labeling and Oligonucleotide Array Hybridization
Agilent's array technology involves the use of dual-dye channels, one for sample RNA and the other for reference RNA. By pilot testing for maximum yield of positive reference data, it was determined that the best reference RNA was a mixture of 80% HeLa RNA and 20% human umbilical vein endothelial cell (HUVEC) RNA. Ten micrograms of RNA for both the sample and reference were used for hybridization. Reference and sample RNA were labeled separately and then combined later during the hybridization process. A DNA primer was annealed to both sample and reference RNA during a brief incubation period. Fluorescent-labeled cDNA was then reverse transcribed from the RNA using SuperScript II (Invitrogen) and either cyanine-3-dCTP or cyanine-5-dCTP (Perkin-Elmer/NEN; Boston, MA). In our case, reference RNA was labeled with cyanine-3-dCTP, and sample RNA was labeled with cyanine-5-dCTP. After reverse transcription, any remaining RNA was degraded by adding RNase A (Amersham; Piscataway, NJ). Each labeled sample cDNA was then combined with its corresponding labeled reference cDNA and purified through QIAquick PCR Purification Kit spin columns (Qiagen) to remove unincorporated dye-labeled nucleotides. After a wash spin to remove contaminants, labeled cDNA (from both initial sample and reference RNA) was eluted into a new microcentrifuge tube. After the addition of 10× control targets (Agilent) and 2× hybridization buffer (Agilent), the resulting hybridization solution was dispensed on to a gasket slide (Agilent), and the array was carefully lowered on top. The sandwiched slides were placed into SureHyb hybridization chambers (Agilent), which were hand tightened to ensure a tight seal between the gasket slide and the array. The arrays were incubated for 16 h in a specialized 60°C hybridization oven, which kept the hybridization chambers in constant rotation so that the hybridization solution would be in continuous moving contact with the entire surface of the array. After incubation, the arrays were removed from their chambers, separated from their gasket slides, and washed in two different wash buffers before being spin dried in a centrifuge and scanned.
Scanning and Feature Extraction of Arrays
All arrays were scanned with Agilent's G2565AA Microarray Scanner System. The scanner uses two lasers, a single harmonic generator-yttrium aluminum garnet laser and a helium-neon laser, to excite and measure fluorescence from the cyanine-3- and cyanine-5-labeled cDNA.
Feature extraction software (Agilent) calculates log ratios and P values for valid features on each array and provides a confidence measure of gene differential expression by performing outlier removal, background subtraction, and dye normalization for each feature. The software filters features that are not positive and significant with respect to background or features that are saturated. It then fits a normalization curve across the array using the locally weighted linear regression curve fit (LOWESS) algorithm to detect and correct dye bias.
Feature values excluded according to the above criteria were estimated using the k-nearest neighbor (KNN) imputation method of Troyanskaya et al. (53). Imputed values were used when missing values constituted <30% of row features. Otherwise, the row mean was used.
Significance analysis of microarrays.
Our primary difference measure was SAM (54). For each gene, the algorithm computes a t statistic (di) with the denominator modified by the addition of a small positive constant to moderate independence of variance and gene expression level: where di is the SAM score for gene i (referred to as the “d score” in the text), x̄i-post is the mean expression level of gene i in group post, x̄i-pre is the mean expression level of gene i in group pre, Si is the standard deviation for the numerator calculation, and S0 is a small positive constant. Genes are ranked according to this statistic, and error is controlled by a permutation procedure. After random permutation of values between groups within each gene, resulting SAM scores are ranked across the data set, and the process is repeated several hundred times. The null distribution is then represented by the average SAM score across these permutations for a given rank position.
Heatmaps were generated by Stanford software developed by the authors and freely available to the academic community (http://mozart.stanford.edu/heatmap.htm). Heatmaps are presented as row normalized (maximum and minimum values are calculated for each gene). If xij is the expression level for gene i = 1, 2,. . . imax and sample j = 1, 2,. . . jmax, and a particular red-green-blue color, color m, is divided into kmax number of shades (m1, m2,. . . mkmax), where kmax is the user-defined number of bins, then xij is assigned a color (denoted color xij) according to the following algorithm: for k = 1, 2, . . ., kmax.
If xij lies in the interval then color xij = mk.
Gene ontology and curated pathway analysis.
Each list of differentially expressed genes was analyzed in the context of gene ontology (GO) to identify groups of genes with similar functions or processes. GO annotation for genes represented on the Stanford-Agilent cardiovascular oligonucleotide microarray was obtained using the Biomolecule Naming Service (BNS) (23). BNS is a high-speed directory service developed at Agilent, which can resolve differences between alias and official gene symbols as well as between different gene identifier schemes, and links to publicly available databases. Gene g was called linked to a GO term, term t, if GO annotation for gene g contained term t or a child of term t. For each GO term t, we tested the hypothesis that term t is over- or underrepresented in a given list of genes against the null hypothesis that distribution of terms is random. This term to gene list association was measured using hypergeometric distribution as the probability of observing k or more genes annotated by term t in the set of n genes, when there are K genes annotated by term t in the whole set of N genes. Namely, for each term t and list of genes l, the P value was calculated as follows: where N is the total number of unique genes represented in the microarray, n is the total number of unique genes represented in set l, K is the total number of entries in N genes mapping to GO term t, and k is the number of entries in n genes mapping to a specific GO term.
Also, for an ordered list of genes of length L, we computed minimal hypergeometric statistics where (n) is the number of genes annotated by term t in the top n genes in list l.
A similar algorithm was used to identify curated pathways that were significantly overrepresented in our data. In this case, significant over- or underrepresentation of differentially regulated genes was determined in a master list of genes derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG) (22). The KEGG database is a collection of graphical diagrams (pathway maps) representing molecular interaction networks in various cellular processes. Each reference pathway is manually drawn and updated with a standard notation. There is a strong representation of metabolic pathways.
Connectivity analysis and visualization.
We used a metasearch tool, as described in Vailaya et al. (55), for automatically querying multiple text-based search engines to aid biologists with the task of manually searching and extracting associations among genes/proteins of interest. The tool launches user queries to multiple search engines, and the retrieved results (documents) are fetched from their respective sources. Each document is then parsed into sentences. An association is extracted for every sentence containing at least two gene names and one verb. Associations are then converted into interactions, which are further grouped into a computational network representation. Sentences and source hyperlinks for each association are further stored as attributes of the corresponding interactions.
In this way, a large association network was constructed from over 350,000 PubMed abstracts by running a query for each microarray gene and retrieving the 100 most recent articles. The process retrieved ∼150,000 PubMed articles. Automated queries were also run for 16 well-known immune diseases (such as atherosclerosis, rheumatoid arthritis, lupus, Type I diabetes, etc.). The process retrieved around 200,000 additional PubMed abstracts. We used user context files and BNS (23) for identifying interaction verbs and gene/protein names in text. The entire list of 350,000 abstracts yielded a large association network consisting of ∼5,200 concepts (genes/proteins) and 19,200 associations among them.
A set of discriminatory genes at a given false discovery rate (FDR) was constructed using SAM analysis (see above). A subnetwork was extracted for each gene, from the large association network, consisting only of genes in the discriminatory set. A cumulative SAM d score was computed by summing the d scores of all other genes in the set of interactions that contained the candidate gene as a participant. This score allows an immediate sense of the change significance of a given network within the analysis of interest. To control for the disproportionate effect of genes with very large networks, we also calculated an average score (the cumulative score divided by the number of connections). In this way, networks with high overall significance could be identified. We refer to the genes at the center of these networks as “nexus” genes as they potentially regulate a large number of genes causing the disease phenotype. We chose the term nexus to distinguish it from the concept of a “hub” gene. According to this idea (5), certain genes have highly correlated gene expression and exist in scale-free topology networks. Although our networks also conform to scale-free topology, no such assumption of correlated gene expression is made for the nexus gene and its network partners. Indeed, it is the fact that the nexus gene has an experimentally based functional relationship with its connected genes rather than a simple gene expression correlation that is the key to its distinction from a hub gene. What these concepts do share, however, is that both hub and nexus genes are attractive candidates for therapeutic targeting and followup studies.
The extracted subnets for the genes of interest were visualized using Cytoscape to display the network diagrammatically. Cytoscape is an open-source bioinformatics software platform for visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other data (47). Each connection was validated manually. With the use of data overlay techniques, SAM d scores for genes in the network are superimposed, via color encoding, upon the nodes of the diagram that represent genes in the network. Additionally, for each node in the network diagram, the expression values for the gene represented by that node are superimposed as a “heatstrip” visualization beneath the node. In the heatstrip visualization, a rectangular area below the node is divided into a set of vertical strips of equal width, with each strip containing a color-coded vertical bar. The width of each bar is equal to the width of the rectangular display area, in pixels, divided by the number of columns in the corresponding heatmap. The vertical bars extend either upward or downward from an imaginary centerline that bisects the rectangular area. Upregulated values are encoded as bars that extend upward from the centerline, whereas downregulated values are encoded as bars that descend downward from the centerline. The bars are color coded to distinguish between experimental classes.
Fifty-one coronary artery samples obtained from twenty-two patients were hybridized to the array. The distribution of cardiovascular risk factors in each patient can be found in Table 1. In particular, 11 (50%) patients presented with hypercholesterolemia, 6 (27%) with diabetes, and 12 (55%) with history of hypertension. With respect to drug therapy, 10 (45%) patients were under treatment with angiotensin-converting enzyme inhibitors, 7 (32%) with angiotensin II receptor blockers, 12 (55%) with β-blockers, and 10 (45%) with statins. None of the patients was implanted with a left ventricular assist device.
Histological grading according to AHA criteria for histopathological grading of atherosclerosis was obtained for 32 of the collected coronary artery samples (Table 2). Fifteen (47%) samples were classified as advanced lesions (grades Va, Vb, and Vc), eight (25%) as intermediate lesions (grade III), and seven (22%) as initial lesions (grade I). Only two (6%) samples were classified as disease free. Representative images of the different disease states can be seen in Fig. 1, C–F.
Differential Gene Expression According to Vascular Disease
Our first set of analyses focused on identifying differential gene expression among various disease severity classes. In these analyses, the two samples classified as disease free, one grade III sample, and two grade V samples were excluded because the RNA obtained did not meet our quality standards for hybridization. The lack of entirely normal arteries, such as might be obtained from young donor hearts, serves as a limitation to our discovery of early stage candidate genes.
We first analyzed gene expression according to histological grading. A grade I (7 samples) versus grade V (13 samples) SAM analysis revealed 168 differentially regulated genes with a FDR of 0.4% (Fig. 1A). Seventy-two genes were upregulated in grade V advanced lesions. Specifically, inflammatory genes such as chemokine (C-C motif) receptor-like 2, chemokine-like factor 7, and myeloid differentiation primary response gene showed the highest d scores (d ≥ 4). We also found cell cycle regulatory genes such as TBC1 domain family member 2, the most significantly upregulated gene in advanced grade V lesions (d score ≥ 4), complement 32, and branched chain aminotransferase 1, cytosolic (d scores ≥ 3), and genes that regulate lipid metabolism such as human prostaglandin endoperoxide synthase, dolichyl-phosphate N-acetylglucosaminephosphotransferase 1, and hexosaminidase B (d scores ≥ 3). GO analysis confirmed these results by indicating “mitotic cell cycle” and “metabolism” (both overrepresented in the top 415 genes with P values ≤ 0.001) as significant processes in grade V advanced stage lesions (Table 3). In initial stage lesions, genes that regulate cytoskeleton organization such as α1-actinin, muscle-specific genes such as smoothelin, sarcoglycan, and vinculin, and genes that regulate cell growth such as TGF-3, endometrial bleeding-associated factor, fibroblast growth factor 1 (acidic), and myocardin were found to be upregulated with d scores greater than or equal to −3. Once again, GO analysis was consistent with these results revealing terms such as “morphogenesis” and “muscle development” (both overrepresented in the top 336 genes with P values ≤ 0.001) to be the most significant biological processes in grade I initial stage lesions (Table 3).
A comparison of AHA grade III and V samples revealed 169 differentially regulated genes (Fig. 1B). Almost all genes found were downregulated in the advanced lesion sample class except for tropomyosin 3 and laminin-γ2, which were downregulated in the intermediate lesion class. Of the 167 genes downregulated in the advanced lesion class, the most significant (d scores greater than or equal to −3) were genes that regulated cell proliferation and growth, the apoptotic process [integrin-linked kinase 2 and insulin-like growth factor (IGF) binding protein 3] and cell-to-cell and cell-to-matrix interactions (collagen type XVIII-α1 and thrombospondin 2). GO analysis of downregulated genes in advanced stage grade V lesions showed “regulation of cell growth” and “cell adhesion” (both overrepresented in the top 162 genes with P values ≤ 0.001) as the most significant biological processes and “structural molecule activity” and “IGF binding” (both overrepresented in the top 174 genes with P values ≤ 0.001) as the most significant molecular function terms regulated by these genes (Table 4).
In contrast, an analysis of grade I versus grade III samples yielded no significant differences in gene expression, with very small numbers of genes at a high FDR. Our findings indicate that the microscopic progression of disease severity in atherosclerosis is not linearly related with the expression profile; instead, grade I and III lesions are more similar to one another than either is alone compared with grade V lesions. This finding is consistent with the fact that the presence of lipid pools (rather than cell type changes) is the distinguishing feature of grade III disease.
Differential Gene Expression According to Clinical Variables
Comparisons of samples from patients known to suffer from clinical CAD with those from patients with no clinically manifest CAD (24 vs. 27 samples) revealed only minor significant differences in gene expression between the classes (a 5% FDR was associated with only 77 differentially regulated genes). The lack of significant expression patterns that could predict clinical manifestation in the face of ubiquitous disease may be explained by the heterogeneity of our clinical group (predominant stable flow obstructing CAD patients were not differentiated from those with predominant acute coronary syndrome presentation). Similarly, samples from patients with a history of hypertension were not different from those without such a history (10 vs. 41 samples, 11 differentially regulated genes with an FDR of 18% as the lowest possible value). Again, the lack of proximate hypertension by definition in transplant candidates (as opposed to the categorization by history) is likely explanatory, masking effects that may have been revealed at an earlier time point.
Diabetes is known to be a strong risk factor for CAD, but the disease mechanism is not well understood. We were interested in studying the diabetes signal in our data and conducted an analysis comparing samples from diabetic patients with those from patients without diabetes. Prior analysis (not shown) demonstrated strikingly similar findings across all disease severity classes, so only the combined data are presented here. Nineteen diabetic samples were compared with thirty-two nondiabetic samples. Our analysis revealed 653 upregulated genes in the no diabetes class and 37 upregulated genes in the diabetes class with an FDR of 0.08% (Fig. 2). Among the genes significantly upregulated in the diabetes class, IGF-1 was the top gene, with a d score of 5.3. IL-1 receptor (fibroblast type) and IL-2 receptor-α were also significantly upregulated (d ≥ 3) and are genes that mediate cytokine-induced immune and inflammatory responses. In addition, CD14 antigen, a mediator of the innate immune response that leads to NF-κB pathway activation and inflammation, was found to be upregulated (d ≥ 3). The presence of these genes suggests that inflammation is more prominent in diabetic CAD than in nondiabetic CAD. Importantly, GO analysis confirmed this finding with processes such as “immune response” and “humoral defense mechanism” (both overrepresented in the top 271 genes with P values ≤ 0.001) as some of the most significant terms for the diabetes class (Table 4).
Protein Validation and Cellular Origin of Transcripts of Interest
A potential limitation of expression profiling of whole tissue segments is the lack of insight into the cell type of message origin. Although techniques such as laser capture microscopy are beginning to provide one answer to this question, a more detailed answer for specific targets is provided by immunohistochemistry. This has the dual benefit of validating the presence of the ultimate product of transcription, the protein, while at the same time localizing it to cell type. We carried out immunohistochemistry for a series of targets to illustrate the utility of this approach (Fig. 3). In many cases, the cell type is predictable from previous knowledge. For example, α-actinin is a component of the thin filament of smooth muscle (31). As such, it is not surprising that to see clear staining of smooth muscle cells in the media and intima of two diseased human arteries (Fig. 3, A and B). Similarly, the role of CD14 in lipopolysaccharide-dependent macrophage activation is known (9), and the presence of CD14 in typical cells in atheroma as well as in subendothelial cells in early disease fatty streak lesions is consistent with this (Fig. 3, C and D). However, immunohistochemistry can also provide important information on cellular localization when there is little prior data to guide expectation or when expectations are confounded. The α-subunit of IL receptor 2 (CD25) might be expected to be found on T lymphocytes migrating into the vascular wall and, indeed, an example of this can be seen in Fig. 3E. However, an extension of staining to endothelial cells is also seen (Fig. 3F), which has not been previously described. Certainly, endothelial cells are increasingly recognized to play an important role in innate immune processes (15) and antigen presentation (43), making this an intriguing finding worthy of further study.
Pathway Identification by Overabundance
Looking for overabundance of differentially regulated genes in curated pathways can add structure to genomic data. We carried out overabundance analysis according to the hypergeometric distribution for disease severity and diabetes groups of genes for pathways within KEGG. This database features a large number of metabolic pathways. We identified several as overrepresented in our data. Within the disease severity group (genes differentially regulated between AHA groups I and V), the cell cycle pathway was the most significantly overrepresented (Table 5), consistent with an actively changing cell phenotype and providing further support for the idea that smooth muscle dedifferentiation is a key process in disease progression. This pathway is shown in Fig. 4 with differentially regulated genes highlighted. Within the diabetes group, ubiquitin-mediated proteolysis is the most overrepresented pathway, with components of the proteosome also highly ranked.
Pathway Discovery by Connectivity Analysis
Although the identification of overrepresented curated pathways offers structure to genomic data and some insight into the significance of differentially regulated genes, new information is limited to already well-recognized pathways. In addition, the binary nature of the hypergeometric statistical test where a gene is either “present” or “absent” in the pathway does not allow for the uncertainty present in any significance call and gives overdue prominence to the choice of the FDR used to determine presence or absence. To this end, we developed a tool that creates a master network of multiple connections drawn from language parsing of the scientific literature and then estimates the extent to which connected genes are differentially regulated in a particular data set, expressing this as a numerical score. This not only allows novel connections and networks to be recognized but also provides a sense of the significance of each network while at the same time allowing the inclusion of genes that may be regulated at a posttranslational level.
Using a SAM significant gene list derived from our disease severity analysis, we generated connectivity networks for every gene. Networks can then be ranked according to d score, cumulative network d score, or average network d score in an interactive spreadsheet. At this point, networks of interest are mapped using the alfa network viewer (55) to create a live interactive network in which all literature references can be validated. For reproduction in print, we developed a module using Cytoscape architecture (Figs. 5 and 6) (47). Nodes are colored from a gradient of bright red to black to bright green, with bright red nodes representing high d score genes in the analysis (in this case, genes upregulated in grade V lesions) and bright green nodes representing low d score genes (those downregulated in grade V lesions). Black nodes represent genes with no significant differential gene expression in either class. Beneath each node is a heatstrip corresponding to the representative row from the heatmap of the analysis. In this analysis, the blue bars represent expression of the class V samples, and the brown bars represent expression of the class I samples. The height and direction of the bars represent the magnitude of the log ratio and whether it is positive or negative. This format allows for simultaneous visualization of multiple interactions, significance levels of gene expression changes, and raw data. We labeled these genes as nexus genes to indicate their fundamental role within a generated network (and to differentiate them from hub genes, as discussed above).
Networks generated from genes differentially regulated between AHA class I and class V disease also featured cell cycle and immune connections (Table 6). For clarity, we have only shown networks where more than four connections were discovered. Networks are ranked by average d score. Two key networks are illustrated in Fig. 5. Cyclin B1 is a cell cycle protein essential for the control at the G2/M transition. It was found to be coordinately regulated with topoisomerase II-α (a DNA topoisomerase) and inverse coordinately regulated with matrix metalloproteinases (MMPs) such as MMP-1, MMP-8, and MMP-9 (33). We also show a subnetwork likely active in T lymphocytes present in the vessel wall. IL-27 (IL17d) is known to support proliferation of naive CD4 T cells and enhance interferon-γ production by activated T cells and natural killer cells (16, 30). Furthermore, it induces phosphorylation of signal transducer and activator of transduction (STAT)1 and STAT3 in both human and murine cell lines, providing an excellent example of a posttranslational relationship that could not have been appreciated from RNA expression data alone.
Diabetes, inflammation, and immunity.
A similar connectivity analysis was carried out for diabetes analysis using genes identified as significant by SAM statistics (Table 7 and Fig. 6). In this case, bright red nodes indicate genes found to be upregulated in diabetics, and bright green nodes denote downregulated genes in diabetics. In the heatstrip below each node, brown bars represent expression of the diabetes class, and blue bars represent expression of the no diabetes class. Genes with the highest cumulative d scores were interferon-γ (248 connections, nonsignificant expression change, cumulative d score = 94.7) and IL-6 (196 connections, d score = 2.34, cumulative d score = 86.1). Notably, however, the size of the network means that the average d score for these nexus genes is low. Interestingly, genes with high average d score ratings (Table 7) include a large number clearly related to diabetes (for example, insulin receptor, glucose-related protein, insulin-degrading enzyme, IGF receptor 1, IGF-2, and IGF binding protein 3).
We employed a pathway-based approach toward the study of atherosclerosis by applying network development tools to a large, complex data set derived from human coronary artery tissue hybridization experiments. Significant genes were studied in the broader context of ontology and by mapping onto known and novel pathways. Discovered connections revealed insights into biology not drawn from basic gene lists alone; instead, highly connected nexus genes were identified and placed in the integrated context of the disease whole.
Analysis of histopathology graded samples according to national guidelines revealed that many genes expressed at a relatively high level in early disease were important marker genes of smooth muscle cell differentiation, regulation, or activation (49). Furthermore, many of these genes were categorized under ontology terms such as muscle development, actin cytoskeleton, and actin filament. Overrepresentation of cell cycle pathway components (KEGG pathway) also suggested active cellular differentiation processes were present, and our connectedness ranking (which adds a weighting according to the “significance” of connected gene expression) found cell cycle and immune networks most prominently weighted. Immunohistochemistry of selected proteins confirmed that the cell of origin of prominently downregulated genes such as α-actinin was indeed the smooth muscle cell.
Although many processes are known to be involved in the development of atherosclerosis, their relative significance has not been established. Our analysis, in which key smooth muscle genes and ontologies are prominent over and above those of other cells or recognized immune signals, suggests that the key process in the progression of atherosclerosis relates to smooth muscle cell dedifferentiation. Although it is possible that this observation reflects a different cellular milieu predefined by the classification system (the array could be acting simply as a sophisticated microscope), extension of the histological classification to multiple cell types, the presence of well-characterized changes in the phenotype of smooth muscle cells in culture, the lack of expression differences between classes I and III, and the breadth of potential ontologies argues against this. Indeed, the power of this approach lies in the absence of an a priori hypothesis. The strongest change in gene expression signal across disease severity might have originated from endothelial cells, macrophages, or migrating circulating immune cells. The advantage of examining the vessel wall as a whole is apparent: our analysis suggests a renewed focus on the changes in smooth muscle phenotype as a strategy for interrupting atherosclerosis at the molecular level.
Diabetes, Inflammation, and Immunity
Although atherosclerosis has been known for some time to be a disease characterized by inflammation, diabetic CAD has until this point been viewed simply as a particularly severe variant. The clinical observation of small lumen vessels with severe disease has not yet been adequately explained at the molecular level. In dividing our samples according to the diabetic status of our patients, we searched for an explanation and found that many genes upregulated in diabetic arteries fell into immune system-related categories such as immune response, defense response, response to pest/pathogen/parasite, and humoral defense mechanism. Pathway tools allowed us to develop networks that revealed relationships between genes not previously connected in this context. Although both an inflammatory state (12, 24, 46) and the innate immune system (8, 36, 37) are known to play important roles in the development of diabetes itself, no study to date has linked these ideas with the development of CAD in diabetics; a genome-scale comparison of expression in human diabetic and nondiabetic coronaries has not been reported (29). The pathway analysis presented here provides new insight into the mechanism behind the contribution of these systems to the development of the disease.
More specific insight is gained from the generated connectivity analysis. Gene networks with high average d scores included many genes obviously related to diabetes, insulin, and IGFs. Also prominent were immune mediators such as the IL-10 receptor. The strong immunity signal was most apparent, however, for the cumulative d scores. The top 30 networks included IL-1, -2, -4, -6, and -10, interferon-γ, and TNF superfamily members-1, -2, -5, and -6. Also of interest were IGF-1 and vascular endothelial growth factor.
Nexus genes function as pivotal nodes in a complex network. We hypothesize that altering the function of a nexus gene by mutation, environmental influence, or pharmaceutical targeting will have more profound consequences than for a less connected gene. As a critical validation of this concept, IL-6 has been implicated in the pathogenesis of atherosclerosis in Type 2 diabetics (36). Several lines of evidence suggest a proatherosclerotic role for IL-6: 1) oxidized LDL enhances the expression of proinflammatory cytokines such as IL-6 (26); 2) supraphysiological concentrations of exogenous IL-6 enhance atherosclerosis in the apolipoprotein E-deficient (ApoE−/−) mouse (19); and 3) plasma levels of IL-6 are enhanced in patients with unstable angina and predict the outcome of patients with acute coronary syndromes (6). However, the ApoE−/−/IL-6−/− double-knockout mouse actually demonstrates enhanced atherosclerotic lesion formation and increased serum cholesterol levels (45), suggesting a more complex picture. In fact the finding that LDL receptor−/−/IL-6−/− double-knockout mice showed little difference from control mice indicates a complex balance of pro- and antiinflammatory factors and further validates the network model approach (48).
As a nexus gene, IL-6 would be a strong candidate for therapeutic targeting and followup study. In addition, our analysis suggests several little-studied nexus genes that would be potential candidates for further scrutiny and analysis. For example, very little is known of the role of the breakpoint cluster region gene (also highly ranked) in atherosclerosis or leukemia inhibitory factor. Similarly, PubMed literature searches for the involvement of other nexus genes in atherosclerotic disease [such as IGF-1, IL-1 receptor, cyclin-dependent kinase inhibitor 2a (CDKN2a), and Tale family homeobox-induced factor (TGIF)] revealed few studies, suggesting that these genes would be strong candidates for further study in the disease modification of diabetes.
A further benefit of our approach is the ability to account for a priori knowledge (in making connections) without prejudice in the process of significant gene discovery (transcriptional profiling is “blind,” in contrast with alternative approaches, which necessitate a prescribed narrow focus by definition). A potential drawback of the connectivity scores, however, is their reliance on published literature. Science is known to manifest a sociological component, and it could be argued that more is known about certain processes, not because these are more fundamental or biologically important but because they have been studied more (for example, NIH funding for cancer research and human immunodeficiency virus is greater than for cardiovascular medicine). Although this may hold to a limited extent in relation to the literature as a whole (connections have to have been studied at least once to be present in the literature), our approach both avoids this issue and as much as it meets it, fails to support it. By weighting our analysis by the significance of differentially regulated genes, we minimize the effect of publication bias (most highly connected genes were connected to <100 other genes). Furthermore, for discovered nexus genes, we found little relationship between the number of publications and connectedness. If publication bias was indeed a factor, a linear relationship between the nexus gene connectivity score and number of publications would result. In fact, this was not the case. IL-6, with one of the highest cumulative significance scores, was connected to 1,165 publications, whereas IGF-1, a lower ranked nexus gene, had over 17,560 publications associated with it.
In summary, insight into the true nature of a disease can only be gained by combining multiple approaches to investigation. Traditional approaches to the molecular genetics of atherosclerosis have focused on manipulating single genes with transgenic technology. Although this is ultimately revealing, it is time consuming, and choosing the pathways on which to focus this mechanistic effort is challenging. To provide an answer to this, we used transcription profiling, a technique that allows simultaneous assay of tens of thousands of genes, to investigate important pathways in atherosclerosis without the requirement for an a priori focus. We applied widely recognized techniques to control false discovery and then identified prominent signals in ontologies and known pathways, finally identifying nexus genes through the use of a connectivity analysis and visualization solution. Our approach takes advantage of the unmatched high throughput of expression profiling while accounting for its significant drawback: many genes are regulated at the posttranslational level. Because our approach connects genes whose gene expression is significantly changed but only weights with significance, overall patterns of pathway activation become apparent even if several members of the network are not regulated at the transcriptional level. By incorporating prior knowledge in the choice of nexus genes and using a comprehensive gene expression platform, new insights into disease processes can be drawn and significant power in the estimation of nexus genes harnessed.
This work was supported by the Donald W. Reynolds Cardiovascular Clinical Research Center at Stanford University.
↵* J. Y. King and R. Ferrara, and T. Quertermous and E. A. Ashley, contributed equally to this work.
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).
Address for reprint requests and other correspondence: T. Quertermous or E. A. Ashley, Donald W. Reynolds Cardiovascular Research Center, Div. of Cardiovascular Medicine, Falk Cardiovascular Research Center, Stanford Univ., 300 Pasteur Dr., Stanford, CA 94305 (e-mail:or ).
- Copyright © 2005 the American Physiological Society