This study creates a compendium of gene expression in normal human tissues suitable as a reference for defining basic organ systems biology. Using oligonucleotide microarrays, we analyze 59 samples representing 19 distinct tissue types. Of ∼7,000 genes analyzed, 451 genes are expressed in all tissue types and designated as housekeeping genes. These genes display significant variation in expression levels among tissues and are sufficient for discerning tissue-specific expression signatures, indicative of fundamental differences in biochemical processes. In addition, subsets of tissue-selective genes are identified that define key biological processes characterizing each organ. This compendium highlights similarities and differences among organ systems and different individuals and also provides a publicly available resource (Human Gene Expression Index, the HuGE Index, http://www.hugeindex.org) for future studies of pathophysiology.
- human tissues
- gene expression
with the recently announced completion of the human genome project (8, 37a), greater attention is now focused on defining the biological significance and functional properties of the ∼30,000 human genes. Toward this end, a fundamental and primary objective is to define global patterns of gene expression that characterize human tissues in normal and disease states. DNA microarrays, along with other high-throughput approaches, can successfully elucidate expression patterns that distinguish disease states such as different types of cancers (2, 3, 6, 11, 15, 28). Individually, these distinguished genes are potential molecular markers or potential therapeutic targets for a disease process (5, 13, 14, 18, 23, 28, 38, 41). Establishment of baseline expression patterns in normal tissues is an essential element in accurate interpretation of those changes associated with pathological states.
In the present study we use oligonucleotide microarrays (GeneChip HuGeneFL) to analyze expression of 7,070 unique sequences in 59 tissue samples representing 19 healthy human tissue types. The purpose is to create a database that can serve as a reference or compendium of expression profiles for studies of human disease. Using a variety of statistical approaches, we identify gene expression patterns that characterize different tissue types. The results reveal striking quantitative similarities and differences among tissues, even for those genes expressed constitutively.
METHODS AND MATERIALS
We obtained 59 human samples of 19 different tissue types, from 49 human individuals including 24 males and 25 females with median age of 63 and 50, respectively [for a table of demographic information, Supplement 1, please refer to the Supplementary Material1 for this article, published online at the Physiological Genomics web site; this information is also available at the Human Gene Expression Index web site (the “HuGE Index”) at http://www.hugeindex.org]. These were provided by tissue banks, surgical procedures or autopsies (Massachusetts General Hospital and Brigham and Women’s Hospital) with appropriate Institutional Research Board consent. The specimens were immediately immersed in liquid nitrogen upon isolation. Each tissue was divided into matched fractions for RNA isolation and histological examination. Only those with normal histological examination were included in this study, and medical histories were not a criterion for exclusion.
Samples were fixed at room temperature in neutral pH phosphate-buffered 10% formalin, dehydrated in graded alcohols, and embedded in paraffin using an automated tissue processor. Four-millimeter-thick paraffin sections were rehydrated and stained routinely with hematoxylin and eosin. Light microscope examination was performed to confirm normal tissue morphology. Histological sections of the tissues are available at http://www.hugeindex.org.
RNA preparation for hybridization.
Total RNA was isolated using Trizol solution (GIBC-BRL, Life Technologies, Rockville, MD). Seven micrograms of total RNA was used for amplification, and the amplified product was labeled with biotin following a procedure described previously (7, 25, 39). Briefly, double-stranded cDNA was synthesized using the SuperScript Choice System (GIBCO-BRL) and a T7-(dT)-24 primer (Geneset Oligos, La Jolla, CA). The cDNA was purified by phenol/chloroform/isoamyl alcohol extraction with Phase Lock Gel (5Prime → 3Prime, Boulder, CO) and concentrated by ethanol precipitation. In vitro transcription was performed to produce biotin-labeled cRNA using a BioArray HighYield RNA Transcript Labeling Kit (Affymetrix) according to the manufacturer’s instructions. cRNA was linearly amplified with T7 polymerase. The biotinylated RNA was cleaned with RNeasy Mini kit (Qiagen, Valencia, CA).
Labeled cRNA, 20 μg, was fragmented and hybridized using the protocol described previously (25). Briefly, the hybridization mixture was incubated at 99°C for 5 min. followed by incubation at 45°C for 5 min. The hybridization was then carried out at 45°C for 16–18 h. After being washed, the array was stained with streptavidin-phycoerythrin (Molecular Probes, Eugene, OR), amplified by biotinylated anti-streptavidin (Vector Laboratories, Burlingame, CA), and then scanned on an HP Gene Array scanner. The intensity for each feature of the array was captured with Affymetrix GeneChip Software, according to standard Affymetrix procedures (25) by performing typical scaling (with target intensity of 100) and normalization for all probe sets.
Quality control of samples.
Approximately 50% of total RNA collected from tissues were discarded secondary to unsatisfactory quality on a 1% agarose gel. Each probe array contains several prokaryotic genes (e.g., bioB, bioC, and bioD are genes of the biotin synthesis pathway from the bacteria Escherichia coli, Cre is the recombinase gene from P1 bacteriophage), which serve as hybridization controls. In addition, expression levels of 3′ to 5′ for both β-actin and glyceraldehyde-3-phosphate dehydrogenase (GAPDH) were evaluated; the 3′/5′ ratio should be less than 3 according to the manufacturer’s instructions. Data that failed to meet this criteria were excluded from analysis.
The Affymetrix GeneChip 3.1 Expression Analysis Algorithm present (P) or absent (A) calls were used to identify maintenance/housekeeping genes. All genes with a present call in at least one sample of each tissue type were included in the maintenance/housekeeping set [marginal (M) calls were conservatively treated as absent]. A hierarchical clustering algorithm (AGNES) (22) in the statistical analysis package SPLUS (37) was used to group the tissue samples using only the housekeeping genes. Using the “Manhattan” distance metric, variables standardized, and the “Ward” linkage algorithm, we found that the 451 housekeeping genes alone were sufficient to clearly group the different tissue types.
To identify tissue-selective genes, we used a two-tailed t-test to distinguish the gene expression levels in each tissue type from all other tissue samples at a 99.99% confidence level. The two-tailed t-test makes underlying assumptions about the distribution of the data, and this high confidence level was chosen to ensure that the list of tissue-selective genes obtained would still be reasonable, even though the assumptions may be met only in part (34). The tissue-selective genes obtained were ranked by their significance value, which determines the probability of observing a given level of discrimination for a gene by random chance. The lower the P value, the better the tissue-selective nature of the gene. A subset of 98 genes with the lowest P values, 14 from each tissue-selective subset from the brain, kidney, liver, lung, muscle, prostate, and vulva, were then used in a principal component analysis (PCA) to separate the tissue samples in PC space. Before performing the PCA, the data were autoscaled such that each gene had a mean of zero and unit standard deviation. This analysis was done using MATLAB. Finally, the coefficient of variation (CV = standard deviation/mean) for each tissue type was calculated to identify the tissue-variant genes.
Housekeeping or maintenance genes and their tissue-specific expression patterns.
The Affymetrix GeneChip 3.1 Expression Analysis program uses a conservative call system to identify gene expression as “present” or “absent” among 7,070 unique sequences on each microarray. All 19 tissues in this study were analyzed, and a set of 451 genes with unique GenBank accession numbers (Supplement 2, published online at the Physiological Genomics web site and at http://www.hugeindex.org), 6.4% of 7,070 unique sequences, were identified as “present” in all 19 tissue types. This set of “housekeeping” or “maintenance” genes encodes proteins mediating a variety of basic cellular functions including intermediary metabolism, gene transcription, protein translation, cell signaling/communication, structure/motility, and other unclassified functions (1) (Fig. 1). Most of the ribosomal protein genes are included in this set. Overall, these housekeeping genes define basic cellular processes and could be used as a reference standard when comparing gene expression studies.
A subset of 535 “housekeeping genes” was previously reported from 11 fetal and adult human tissue samples (39), and 358 genes are common to both lists. An important result in both studies is that the majority of the genes commonly considered to have a housekeeping function (e.g., β-actin and GAPDH) exhibit considerably variable expression levels from one tissue type to another. In fact, we found that quantitative expression profiles for the maintenance/housekeeping genes alone exhibit unique patterns for each specific tissue type. In particular, using a hierarchical clustering analysis of the 451 housekeeping genes, we successfully clustered different tissues according to tissue type (Fig. 2). We observed that all clusters were derived from two major branches: one branch contains the hematopoietic, reproductive, urinary tract, and gastrointestinal systems, while the other major branch contains the nervous, muscular, and liver systems.
We also identified the 15 most highly expressed genes among 451 (Table 1). They are primarily ribosomal proteins along with β-actin, GAPDH, and genes associated with defense/cell death (clusterin, metallothionein 2A). In addition, the CV for each of the maintenance/housekeeping genes was calculated, and the 15 genes with the highest and lowest CV were designated as most variable genes and most constant genes, respectively (Tables 2 and 3). Of note, both β-actin and GAPDH, commonly assumed to have constant expression levels, were among the most variable genes. The 15 most constant genes among 451 maintenance/housekeeping genes could provide new standards for quantitative controls on all gene expression studies.
Identification of tissue-selective genes and class prediction.
A two-tailed t-test was used to identify genes that are statistically highly expressed in a specific tissue type (P < 0.0001) using all 7,070 unique sequences from 59 samples. Our results reveal subsets of genes with unique GenBank accession numbers that are highly expressed only in brain (618 genes), kidney (91 genes), liver (277 genes), lung (75 genes), muscle (317 genes), prostate (46 genes), or vulva (101 genes). These are labeled “tissue-selective genes” (Supplement 3, published online at the Physiological Genomics web site and at http://www.hugeindex.org) as they are predominantly, not exclusively, expressed in one tissue type. Using 98 most selective genes (Supplement 4, published online at the Physiological Genomics web site and at http://www.hugeindex.org), we performed PCA to examine whether it was possible to discriminate among tissue types in three-dimensional expression space. As shown in Fig. 3, these “tissue-selective genes” can serve as templates for class prediction. For example, in brain the tissue-selective genes include those associated with myelin structure (e.g., myelin basic protein), with astrocytic differentiation (e.g., glial fibrillary acidic protein), with synaptic reorganization (e.g., calcium channel, voltage-dependent β2, calmodulin 3, and GAP-43), and with neurotransmission (e.g., GABA receptor and glial high-affinity glutamate transporter). Among the tissue-selective genes (Supplement 3) we also identify smaller subsets of “tissue-specific” genes, defined statistically as P = 0 and having no overlap in expression level with any other tissue.
Kidney-selective genes include those known to be highly expressed in this organ such as uromodulin (Tamm-Horsfall glycoprotein), α-enolase, and ion transporters (e.g., β1-subunit of Na+-K+-ATPase, Na-Cl electroneutral thiazide-sensitive cotransporter, K-inwardly-rectifying channel, bumetanide-sensitive Na-K-2Cl cotransporter, amiloride-sensitive epithelial sodium channel, and amiloride binding protein 1). In addition, hydroxysteroid (11-β) dehydrogenase 2 (11β-HSD2), a gene that inactivates glucocorticoids and prevents them from binding to the nonselective mineralocorticoid receptor, is also highly expressed. In the kidney, it is this NAD-dependent high-affinity isoform which is thought to endow specificity on the receptor comprising nature of an autocrine protector of the mineralocorticoid receptor and play an important role in cardiovascular homeostatic mechanism (24).
As anticipated, the liver-selective genes include those associated with the coagulation pathway (e.g., factors II, V, VII, IX–XII, fibrinogen, plasminogen, protein S, and antithrombin III), complement pathway (e.g., C2, C4, C5, C8, C9), alcohol metabolism (e.g., alcohol dehydrogenase), lipid process (e.g., apolipoproteins), bile metabolism (e.g., bile acid CoA:amino acid N-acyltransferase), antitrypsin member 8, and xenobiotic metabolism (e.g., cytochrome P-450). Additionally, serum amyloid A1 and A4, constitutive for amyloid fibril formation, angiogenin, ribonuclease, RNase for angiogenesis, α1-glycine amidinotransferase for creatine biosynthesis, cysteine dioxygenase, type I for cysteine metabolism, and genes associated with growth, such as insulin-like growth factor, growth hormone receptor, hepatocyte growth factor (HGF) activator, are highly expressed in liver as well. Unlike 11β-HSD2, a gene specific in kidney, 11β-HSD1 is the gene in liver involved in steroid metabolism.
The lung-selective genes include those associated with extracellular matrix (e.g., pulmonary surfactant associated protein), HLA/cytokine (e.g., MHC II, γ-interferon inducible protein 30) and others (von Willebrand factor, claudin 5, palmitoyl-protein thioesterase 2, mannose receptor, lung cytochrome P-450). The muscle-selective genes include those associated with the cytoskeleton (e.g., actin, α1, actinin α2–3), contraction (e.g., tropomyosin, troponin, myosin), mitochondria (e.g., cytochrome C-1, ubiquitin, creatine kinase), and metabolism of glucose, glycogen, and lipids (e.g., lactate dehydrogenase, phosphoglucomutase 1, carnitine palmitoyltransferase). Furthermore, carbonic anhydrase III for CO2 metabolism, creatine kinase, mitochondrial 2 (sarcomeric) for energy transduction, and gene for thermal regulation (neurotrophic tyrosine kinase, receptor, type 1) are also highly expressed in muscle.
The prostate-selective genes include those that are associated with hormones (e.g., prostate secretory protein), redox pathways (e.g., prostatic acid phosphatase, aldehyde dehydrogenase 6), cytoskeleton (actin-binding protein-278), and others (e.g., prostate-specific antigen, T-cell receptor-γ, TGF-β3, estrogen regulated LIV-1 protein). The vulva-selective genes include those associated with the cytoskeleton (e.g., keratin, ladinin, loricrin), extracellular matrix (e.g., desmoplakin 1, profilaggrin, epican, connexin 26, galectin 7, desmocollin), and hair follicle-related protein (e.g., basic/acidic hair keratin).
Identification of variant genes within a tissue.
Another question of significant interest is whether there are genes whose tissue-specific expression is highly variable between different individuals. This was done by calculating the CV for genes called “present” in all samples. Figure 4 shows a histogram depicting the distribution of CV among different kidney specimens. The mean CV for the distribution was 0.31 with a standard deviation of 0.25. The genes with CV score greater than two standard deviations away from the mean are highlighted, indicating those that are most variable in kidney. These transcripts include several known to be associated with disease phenotypes. For example, the Na-Cl electroneutral thiazide-sensitive cotransporter is the target of a major antihypertensive diuretic, and mutations in this gene can cause Gitelman’s syndrome, an autosomal recessive disease characterized by diverse abnormalities in electrolyte homeostasis (7, 36). In addition, aldose reductase plays a key role in the diabetic complications of kidney, nerve, and retina (10, 19, 29–31). We also observed similar distribution patterns of CV for brain, liver, lung, muscle, and vulva and identified small subsets (<2%) of genes that are highly variable (Supplement 5, published online at the Physiological Genomics web site and at http://www.hugeindex.org). In lung, the most variant genes include integrin-β2, which has been shown to predispose individuals to recurrent bacterial infections (16, 20), and antileukoproteinase, which is involved in several chronic and acute diseases of the respiratory tract (4, 35). In liver, the most variant genes include insulin-like growth factor 2, a putative susceptibility factor for obesity (9, 12, 32); fibrinogen-γ, defects of which are a cause of thrombophilia (26, 27); and hepatic lipase, a complete deficiency of which causes coronary atherosclerosis and premature dyslipidemia (17, 33). Each of the other 13 tissue types contains samples from less than 3 different individuals. Therefore, CVs were not calculated.
In 1965, Watson et al. (40) defined the housekeeping genes as those genes that are “always expressed” in every tissue to maintain cellular functions. This study is the largest quantitative survey, evaluating ∼7,000 expressed sequences from 19 normal adult human tissue types. We identified a subset of 451 genes expressed in all normal adult human tissue types. This result supplements a previous report of 535 genes that were expressed in 11 fetal and adult tissues (39). Also, 358 of these maintenance/housekeeping genes are common to both lists. Functional annotation revealed that these genes participate in many active cellular processes.
We also found that expression of many of these maintenance/housekeeping genes is highly variable. In particular, we report here that these maintenance/housekeeping genes alone contain “tissue-specific” expression patterns (Fig. 2), which may be used to distinguish an individual tissue type. These results suggest that the gene expression patterns of maintenance/housekeeping genes reflect intrinsic differences among the individual tissues, most likely related to differences in metabolic activity and cytoarchitecture. The ability of housekeeping genes to define different biological states suggests that they may be suitable for distinguishing different disease states as well. In a practical sense, these genes could be used as standard controls on all gene expression studies to facilitate data comparison among laboratories and across platforms.
We identified subsets of genes that are highly expressed in one tissue type but not in others. We labeled these “tissue-selective genes” rather than “tissue-specific genes,” since very few were expressed in only a single tissue type and a number of human tissues were not included in our analysis. These tissue-selective genes, ranging from 75 to 621 genes per tissue, have enough power to provide class prediction using a three-dimensional PCA. Additionally, these subsets of genes are found to be closely associated with the major functions carried out by each specific tissue type, e.g., genes related to myelin proteins and glial differentiation in brain; genes involved in coagulation and complement pathways in liver; genes associated with channels and transporters in kidney; and genes for pulmonary surfactant proteins in lung. We suggest that these genes may represent potential “signature” genes for the specific tissues with the important caveat that more tissues need to be sampled (e.g., endocrine system) to refine the “tissue-selective” fingerprints. Furthermore, ongoing efforts to deduce the functions of orphan genes will benefit from defining those that have tissue-selective expression patterns, as this will highlight a limited number of biological processes that should be considered.
Although all samples used in this study are normal tissues from a histological perspective, one important observation from our study is the demonstration that, for a given tissue, different individuals have a small set of genes with highly variable expression. Logically, tissues from either biopsy or autopsy often contain multiple cell types, which are in various states. It is conceivable that the variations are due to differences in the cell types and their states when the tissues were collected. In addition, the differences of age, gender, underlying health, and medications may also play roles in the variation. Further studies will be needed to address these issues. Even so, the presence of tissue-variant genes is consistent with the notion that the genotypes and the inherent plasticity of human tissues of an individual may contribute to gene-specific expression.
We thank Yangxi Wang, Mani Khounviengsay, Nathan Best, and Ken Auerbach for web site design. We also thank Frank Haluska and Mohammed Miri for assistance with tissue acquisition. We acknowledge Lynn Mills, a high school biology teacher, for inspiring the name of the HuGE Index.
This work was supported by the Merck Genome Research Institute. In addition, support was provided by National Institutes of Health Grants DK-36031 and DK-58849 (to S. R. Gullans), DK-09987 (to L.-L. Hsiao), CA-80084 (to F. Dangond), NS-16367 (to M. Mahadevappa), and DK-58533 (to Gregory Stephanopoulos). This work was also partially supported by Grants DE-FG02-94ER-14487 and DE-FG02-99ER-15015 (to Gregory Stephanopoulos) from the Engineering Research Program of the Office of Basic Energy Science at the Dept. of Energy, by the BSG foundation (to R. Bueno), and by Integrative Graduate Education and Research Traineeship 9870710 (to P. Haverty).
↵1 Supplementary Material (supplements 1–5) to this article is available online at http://physiolgenomics.physiology.org/cgi/content/full/7/2/97/DC1.
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).
Address for reprint requests and other correspondence: S. R. Gullans, 65 Landsdowne, Cambridge, MA 02140 (E-mail:).
- Copyright © 2001 the American Physiological Society