In functional genomics, the high-throughput methods such as microarrays 1) allow analysis of the relationships between genes considering them as elements of a network and 2) lead to biological interpretations thanks to Gene Ontology. But up to now it has not been possible to find relationships between the functions and the connectivity of the genes in coexpression networks. To achieve this aim, we have defined a double connectivity for each gene by the numbers of its significant negative and positive correlations with the other genes within a given biological condition, or group. Here, based on the analysis of 1,260 DNA microarrays, we show that this double connectivity clearly separates two types of genes, those with a predominantly strong negative connectivity, hub− genes, and those with a predominantly strong positive connectivity, hub+ genes. Interestingly, the hub+ genes concerned transcription factors more often than by chance and, similarly, for the hub− genes concerning miRNA predicted targets. Furthermore, a meta-analysis of GO annotations carried out on 67 groups in humans and rats shows that these two types of genes correspond to a functional biological duality. The hub− genes were mainly involved in basic functions common to all eukaryote cells, whereas the hub+ genes were mainly involved in specialized functions related to cell differentiation and communication. The separation and the biological role of these hub− and hub+ genes provide a powerful new tool for a better understanding of the control and regulation of the key genes involved in cellular differentiation and physiopathological conditions.
- Pearson correlation
- systems biology
- transcription regulation
mathematical network modeling of complex biological systems, observed at the neuronal (37), protein (6, 39), metabolic (29, 32), or gene expression (5, 12, 13, 19, 36) level, has become an important step in understanding the cell's functional organization (for review, see Refs. 3, 25). In a general way, the connectivity of a node within a network allows the degree of involvement of the node in the network to be appreciated. The most highly connected nodes are called “hub.” It has been already shown that they may play an important role in the network (6). Currently, there is a growing interest in defining gene networks from microarray expression data (1, 2). Numerous sophisticated machine-learning algorithms coupled to various statistical methods have been proposed to achieve this aim (13, 16, 26, 40). Genes are usually said to be coexpressed or connected when their expression levels are linearly linked within a group. The connectivity of each gene for this group is then defined by its number of connections. The Pearson correlation remains the basic method for assessing pairwise expression gene linear links. It is widely used both for clustering gene expression profiles (18) and for calculating connectivity in gene coexpression networks (40). Surprisingly, little use has been made of the algebraic (negative or positive) sign of this coefficient. Most of the studies deal with the absolute value of the correlation coefficient, which allows the pairwise gene links to be represented in a binary way. In this study, we define a double connectivity, that is either positive or negative, for each gene in a group and for highly connected hubs.
Sharing genomic knowledge and assessing gene functions were such complex tasks that by the late nineties it became necessary to collate a database of gene properties expressed in a standardized manner (31). Functional gene annotations using Gene Ontology (GO) (33, 34) summarizes knowledge about the role of each gene in a dynamic controlled vocabulary and is very helpful for carrying out functional interpretations, particularly when interpreting the functions of hub genes (17, 22, 30).
In addition, the existence of publicly available microarray datasets allows meta-analyses (23, 30), or analyses over pooled data (35). Such analyses are essential for comparing the way genes operate in various observation conditions. However, due to the high cost of microarray experiments, few studies have been performed in samples with more than five observations, which is the minimum size for computing our correlation.
In this work, we made two main hypotheses: 1) the double connectivity discloses the existence of hub− or hub+ genes, having a predominantly strong negative or positive connectivity, 2) these hub− and hub+ genes have different functions. Our objectives were then 1) to analyze double connectivity of genes from a large number of public microarray data sets obtained in various tissues and biological conditions, and 2) to give a general interpretation using functional annotations of genes with GO terms.
MATERIALS AND METHODS
We first analyzed microarray data obtained in our laboratory from left ventricular hypertrophy in three models of hypertensive rats. (15). Then, microarray data sets were obtained from public repositories for microarray gene expression data, ArrayExpress (European Bioinformatics Institute, Cambridge, UK) (28) and Gene Expression Omnibus (National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD) (4). Data were obtained from various mammalian tissues, excluding cultured cells. We selected the data from Affymetrix GeneChips (Affymetrix, Santa Clara, CA) (RAE230A in rats; HG-U133A, HG-U133B, and HG-U133 Plus 2.0 in humans; and U74A in mice). We selected, from 35 datasets, 130 groups that were made up of at least five individuals showing the same controlled biological conditions (e.g., experimental, clinical, etc.), 48 groups in rats, 81 in humans, and only 1 in mice because of the inadequate sizes of the group samples available in this species (Supplementary Table S1).1 Various tissues had been analyzed in these groups: namely skeletal muscle, heart, kidney, bladder, prostate, breast, reproduction organs, lung, bone marrow, or brain in normal or pathological states. In humans, more than half of the groups were composed of cancer tissues. For a large majority of the groups, the microarray data had been preprocessed using Affymetrix software tools (MicroArray Suite 5.0 or GeneChip Operating Software) (see supplementary information p. 8). No prior filtering of the data was found to be necessary for the purposes of this study.
Double connectivity computation.
The computations were made separately for each group. The microarray gene expression data were represented in an N × n matrix, where N is the number of probe sets (taken as roughly equivalent to genes) and n the number of individuals in each group. We took log2-transformed gene expression values. The distribution of the log2-transformed expression data was found to be normal for >90% of the genes in >90% of the 130 groups using a one-tailed Shapiro-Wilk test at the significance level of 1%. We included all the probes of the microarrays in the computations irrespective of their expression level, their degree of normality, and their accuracy (values called “absent”). We also included in the computation the expression values that were below the detection threshold because although they give correlations that may be of interest, they could not be highlighted when missing some values. Thus, we estimated the linear link between two genes within a group by the Pearson correlation coefficient.
Within a group, two genes were considered to be connected if the absolute value of the Pearson correlation coefficient r between their log2-transformed expression values was significantly greater than the threshold value rth. We chose the threshold rth from the usual two-tailed Student's t-test at the significance level of 5% with (n − 2) degrees of freedom. The sign of the correlation coefficient gave the sign of the link. The number of negative (k−) or positive (k+) links of one gene with all the other genes of the microarray defined its double connectivity (k−, k+) in the coexpression network. For each gene, double connectivity can be viewed as computed by multiple testing. But since our purpose was not to determine the family of genes connected to one gene, we did not need to control the false positive rate using an adjusted P value. We only aimed at sorting the genes according to their double connectivity, and it was then sufficient to compute the connectivity in a similar way for all the genes.
When the group size was <20, a jackknifed Pearson correlation was computed to reduce the effect of extreme values. To test whether double connectivity values could be obtained by chance, we performed the double connectivity computation after randomization of the data by bootstrap. For each gene within a group, we conducted a resampling with replacement of its expression values from the observed values in the group. Then we calculated the double connectivity of each gene for the group. We repeated this procedure 1,000 times and computed the average of the negative and positive connectivities.
Defining hub− and hub+ genes.
Hub genes are highly connected genes, and the sign of one hub gene is given by the sign of its strongest connectivity, negative or positive. Mathematically, hub− genes were defined by: 1) k− > k+; and 2) k− > kth with kth = <k−> + 2SDk−, where <k−> indicates the average of the distribution of k− and SDk− the standard deviation of k− for the group. Similarly, hub+ genes were such as: 1) k+ > k−; and 2) k+ > kth. A study of the robustness of the hubs determination is given in the supplementary information (see Supplementary Fig. S1).
Involvement of hub genes in transcriptional regulation.
We investigated the relationship between hub genes and regulation of gene expression by transcription factors or microRNAs (miRNAs) in humans. 1) We obtained transcription factors from the KEGG website (http:/www.genome.ad.jp/, last updated on 10/31/2007). We collected 1,148 gene symbols of transcription factors, which matched with 1,096 probes of the HG-U133A GeneChip. 2) We obtained miRNA target predictions from the TargetScan site (http://www.targetscan.org/, Whitehead Institute for Biomedical Research, Cambridge, MA) (20). This database collects miRNAs families that were conserved during evolution across human, mouse, rat, dog, and chicken and predicts their conserved target sites. We selected the 454 human miRNAs having the most reliable associations between one miRNA and one predicted target (context score percentile >50). We obtained 11,670 miRNA target genes on the HG-U133A GeneChip.
We tested whether the proportions of hub− or hub+ genes within the transcription factors or the predicted targets of each miRNA differed from those obtained by chance using a two-sided Fisher exact test at the significance level of 5%. We used the Benjamini and Hochberg adjusted P values for multiple testing (9). It enabled us to detect whether there were significant enrichments or impoverishments of transcription factors in hub− genes, as well as to detect the miRNAs that significantly targeted more or less hub− genes than by chance. The same analysis was carried out for the hub+ genes. We performed all these computations over the 31 human data groups exhibiting a clear separation of hub− and hub+ genes, taking into account only the probes associated with a gene symbol.
We performed the same analysis in experimentally validated miRNA targets, obtained from the public database TarBase v4.0 (http://www.diana.pcbi.upenn.edu/tarbase.html). These miRNA targets corresponded to 1,048 genes-probes on the HG-U133A GeneChip.
To give a functional interpretation to the hub− or hub+ potentiality of genes, we performed a functional meta-analysis using GO annotations. For that, we examined the public data from microarrays for which we had enough groups: HG-U133A in the human (43 groups) and RAE230A in the rat (48 groups). We retained those groups in which the separation of hub− and hub+ genes was evident (31 out of 43 groups in the human and 36 out of 48 in the rat). These groups are marked with OK in Supplementary Table S2.
Defining top− genes and top+ genes in the meta-analysis from mean standard double connectivity.
For the functional meta-analysis, carried out in the human and in the rat, we used the annotations of genes in the GO category “biological process” that were supplied by Affymetrix on 11/5/2007. We found 16,848 genes annotated in humans and 9,107 genes annotated in rats in this GO category. To accommodate the different group sizes, the negative and positive connectivities of each gene were standardized each to have a mean of 0 and an SD of 1. Then, over all the selected groups of humans or rats, we computed the mean standard double connectivity (kms−, kms+) for each gene. We retained the top 1,000 annotated genes that exhibited the best mean negative or positive standard connectivity in humans, and the top 540 annotated genes in rats, i.e., ∼ 6% of the number of the annotated genes for both. We named them top− or top+ genes. Homology between human and rat top genes was tested using the HomoloGene database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=homologene).
GO terms significantly enriched in top− or top+ genes.
We performed preliminary studies using bioinformatics Web tools such as GOstat (7) (http://gostat.wehi.edu.au/) and eGOn (8) (http://www.genetools.microarray.ntnu.no/egon/index.php). To analyze a large number of gene lists, we developed our own software in C language. We counted the number of top− genes and top+ genes annotated by each GO term. We selected the GO terms that significantly annotated more numerous top− (or top+) than the other genes, using a two-sided Fisher exact test at the significance level of 10%. We used the Benjamini and Yekutieli adjusted P values for multiple testing in case of dependent data (10). We called them GO terms enriched in top− (or top+) genes. So, four cases could be expected for a GO term, whether it was enriched or depleted in top− genes and whether it was enriched or depleted in top+ genes.
RESULTS AND DISCUSSION
Double connectivity and hub genes.
For the different species and tissues, each scatter plot of the genes as a function of their double connectivity (k−, k+) shows a very characteristic shape. It is formed by two separated “wings,” symmetrical with respect to the principal diagonal (Fig. 1A and Supplementary Fig. S2). Figure 1A shows the wings distinguishing three types of genes, namely those with a strong negative connectivity (hub−: green dots), those with a strong positive connectivity (hub+: red dots) and the others (gray dots). Over 70% of the 130 groups tested exhibit this wing-shape gene scatter plot. The remaining 30% exhibit the same symmetry, but poorly separated wings. We studied the distributions of the gene expression variation coefficient according to the quality of wing separation (Supplementary Table S3). In cases of clear wing separation, variation coefficients are higher than in cases of bad wing separation. When the dispersion is weak, the range of regulation is small. Many physiological regulations cannot then be evidenced by correlation, which results in a poor separation of hub− and hub+ genes when plotted.
As shown in Fig. 1A, randomizing the data by bootstrap 1) considerably reduced the range of the connectivities, 2) suppressed the wing shape of the scatter plot, and 3) maintained the symmetry with respect to the principal diagonal. These observations indicate that 1) the symmetry around the diagonal is due to a mathematical property, the transitivity, of the correlation, and that 2) the separation (wing shape) of hub− and hub+ genes is due to a biological property of regulation, shown by the marked imbalance between the negative and positive correlations of some genes.
Focusing on a given gene of interest G, Fig. 1B shows, in each example, that the genes most strongly correlated to G exhibit a double connectivity with specific features. When their correlation with G is positive, they make a condensed neighborhood around G on the scatter plot. In contrast, in case of negative correlations with G, they make a condensed neighborhood that is symmetrical with the former. When G is a hub, the best correlations are seen with two sets of genes that are hubs as well. This neighboring feature is partly related to the correlation's transitivity. Indeed, if two genes are perfectly positively correlated (correlation coefficient = 1), they necessarily have similar double connectivity values. Conversely, if two genes are perfectly negatively correlated (correlation coefficient = −1), the negative connectivity of one gene is equal to the positive connectivity of the other gene.
Double connectivity and expression level.
The importance of a gene is usually examined from its RNA expression level, yet its involvement in the gene network is rarely taken into account. The computation of the Pearson correlation between the negative or positive connectivity and the mean expression level did not reveal any clear, consistent correlation across the different groups (Supplementary Table S4). This major result clearly confirms the absence of a clear linear link between the gene expression value and the unsigned connectivity that has already been shown (14). It indicates that the double connectivity does not supply the same information as the expression level, and so offers new ways to analyze and interpret transcriptional data.
General statistics of hub genes properties.
The properties of hub− and hub+ genes are illustrated for one group of rats in Fig. 2 and for each group in Supplementary Table S2. According to our definition criteria, the hub− genes on average represented ∼4% of the genes on the microarray for all the groups. There were fewer hub− genes than hub+ genes (∼11%). Hub− genes were expressed at significantly higher levels in 91% of the groups.
Up to now, the concept of hub genes or hub proteins has never been associated with the sign of the pairwise interactions. Using either unsigned gene coexpressions or unsigned protein interactions, several authors pointed out the importance of hub proteins and showed the modular structure of the whole network (6, 13, 16). In addition, the authors stressed the slower probable evolution of hub proteins than less connected ones (6, 13). In this work we separated hub genes into two types. The hub− genes, with the most numerous negative correlations, could mainly be involved in repressive processes, whereas the hub+ genes could mainly be involved in activator processes. In addition, we showed that 1) the genes, correlated or anticorrelated with a given hub gene, were hub genes as well, and that 2) these two sets of hub genes were strongly anticorrelated to each other (data not shown). Therefore, taking the sign of the correlations into account uncovers the clear balance between coactivated or coactivator genes, and coinhibited or coinhibitor genes, that is involved in a general regulation process.
Transcription factors and microRNAs.
Our main finding was evidence of the statistical tendency for hub+ genes to include transcription factors and for hub− genes to include miRNA targets. Table 1 shows that, in 55% of the groups, hub+ genes were found more often in the transcription factors than would happen by chance, whereas hub− genes were underrepresented in 74% of the groups (see Supplementary Table S5). Conversely, more hub− genes than by chance were found in miRNA predicted targets, for 51% (233 of 454) of miRNAs, on average over all the human groups, independently of their pathological state (see Table 1), whereas significantly less hub+ genes than by chance were observed in miRNA targets, for 53% (242 out of 454) of miRNAs (see Supplementary Table S6 for details). These results are also confirmed for experimentally validated targets (Supplementary Table S7).
Hub genes and top genes.
As the double connectivity of genes varied a lot between groups, the status of hub was variable (Fig. 3A). Only 25% of the genes were hub genes in more than three different groups. The genes that were frequently hub, however, were hub with the same sign. This variability of hub status complicated the meta-analysis, and we preferred to use mean standard double connectivity when defining top− and top+ genes. Figure 3B shows the clear separation between top− and top+ genes. These top genes were hub genes with the same sign in an average of ∼7 groups. Thus, the top genes corresponded to those hub genes particularly highly connected in some groups.
The functional meta-analysis used data from a range of tissues studied in various conditions. A stability in the functions of the top− and top+ genes in both humans and rats clearly emerges from this diversity. Strikingly, the GO terms corresponding to similar biological functions appeared to be significantly enriched in top genes with the same sign. Table 2 gives some examples obtained from human data: the GO terms concerning “protein degradation” were exclusively enriched in top− genes (asterisk), whereas those concerning “cell communication and social life” were mainly enriched in top+ genes (dagger). So, we grouped all the enriched GO terms that shared similar biological functions into GO groups, detailed in Supplementary Table S8. These we arranged into eight broader functional groups that are enumerated in Table 3. GO terms concerning nontranscriptional processes, protein processes, cell cycle, energy conversion, and cytoskeleton were mainly enriched in top− genes (italics). In contrast, GO terms related to ion transport, cell communication and signal transduction were mainly enriched in top+ genes (underlining).
These points are illustrated in the supplementary lists of the top− and top+ genes that are annotated by some of the GO terms in Table 2 (Supplementary Tables S9–S11). The top− genes annotated by the GO terms concerning “protein degradation,” whatever their enrichment, are largely related to ubiquitin-proteasome and proteolysis systems (∼100 distinct genes). Inversely, the top+ genes referred instead to proteins with various peptidase or ligase activities that were often specialized and related to cell cycle or signal transduction (see Supplementary Table S9). In the group “cell communication,” the GO term (GO:0007264) called “small GTPase-mediated signal transduction” annotated mainly top− genes such as Rab and Rho GTPases, or members of the RAS oncogene family, involved in membrane budding and trafficking (Supplementary Table S10). The GO term (GO:0007186) called “G protein-coupled receptor protein signaling pathway” annotated numerous top+ genes (89 genes), which are associated to various receptors concerned with the activities of the cellular specialization (Supplementary Table S11).
All the above results came from statistical analyses comparing observed proportions of genes to proportions obtained by chance. Therefore, they revealed only general tendencies, in terms of enrichment or impoverishment. These tendencies did not exclude opposite or complementary features. In summary, top− genes enriched the GO terms that essentially characterize processes relative to essential functions common to all cells (protein machinery, cell cycle, and energy conversion), whereas top+ genes enriched the GO terms that essentially characterize specific processes of pluricellular organisms (social life, cell communication, development, and specialized activities specific to cellular differentiation). The results obtained for the other GO categories (“cellular component” and “molecular function”) confirmed the coherence of the above results (data not shown).
An identical study was performed in the rat (Supplementary Tables S12 and S13). While 85% of the probe sets from the rat RAE230A GeneChip referred to genes homologous to those of the human HG-U133A GeneChip, only 37% of the top− genes and 12% of the top+ genes identified in rats were homologous to human top− genes and top+ genes, respectively. This result extends again 1) the variability of the top status by species and 2) the clear distinction between top− and top+ genes, since the former appear to be more conserved across the two mammalian species studied. Although a much larger number of GO terms were found enriched in top− or top+ genes in humans, these GO terms shared the same functional biological criteria, and 88% of those found in rats were present in humans. These results confirmed the functional duality of the most negatively and positively connected genes, therefore providing strong evidence for a general organizing principle that might be linked to evolutionary processes.
In fact, the genes involved in the general cell functioning are likely to be more conserved across species than the genes involved in more specialized cellular functions. Such an observation has already been made for the ubiquitin-proteasome system, which represents the major pathway for intracellular protein degradation in eukaryotes. In particular, the 26S proteasome, responsible for the proteolysis process, has been shown to be highly conserved from yeast to mammals (21, 38). We found that numerous genes of this system were top− genes, suggesting that top− genes are likely to be conserved genes. This point is coherent with the fact that the top− genes were predominantly involved in general cellular functions. Finally, based on these observations, we could ask whether miRNA predicted targets are mainly concerned with general cellular functions.
The double connectivity, computed within a coexpression network for a group of biological conditions, provides a sense of the balance between the positive and negative regulations of the genes. As in the case of gene expression level, the connectivity is quite variable according to the tissue and the conditions of observation. However, these two parameters bring information of a different kind. Because the general knowledge of gene expression regulation is still very incomplete, it is impossible to have gold standard data that would conclusively validate the information shown by double connectivity. It should be borne in mind that, as with all methods based on network modeling, any results obtained from double connectivity require corroboration in subsequent analyses of biological networks to be definitively validated (25). The highly significant GO enrichments indicated by our meta-analysis are a powerful incentive to improve the interpreting methods that combine gene expression and connectivity. To achieve this aim, there are some conditions to be fulfilled: 1) sufficient size of groups, 2) well-controlled experimental conditions to allow differential analyses, and 3) a focus on fields already well studied and widely referenced. Under these conditions, the Pearson coefficient will have more statistical meaning, thereby reinforcing the accuracy of the analysis of differential connectivity. Our method works as a filter to select genes of interest, and the analyses of differential double gene connectivity must be combined with other methods such as those using differential gene expression. Furthermore, the interest in genes that are strongly connected but weakly expressed could provide evidence of regulations that have often been overlooked. From this perspective, it would be interesting to make use of new microarray technology that permits detection of low expression level (11). Such studies would improve the analysis and further interpretation of negative transcriptional regulation.
It is particularly exciting to see that hub genes are present in different proportions according to their sign, when they are transcription factors or miRNA targets. Concerning the transcription factors, the enrichment in hub+ genes in a large number of human data groups reveals a predominant involvement of transcription factors in activator processes. Furthermore, the enrichment of miRNA targets in hub− genes suggests that some hub− genes are involved in transcriptional inhibitory processes, which are induced by miRNAs (24). So our results reinforce the challenging recent findings about posttranscriptional gene expression regulation by noncoding RNAs (20, 27). They thus provide a framework to analyze in detail the involvement of genes in transcriptional regulation processes, as much activator as inhibitor processes.
Here we have introduced for the first time the concept of double connectivity using the sign of the correlation. It turns out to be a powerful tool for analyzing both the transcriptional regulatory properties of the genes and the functional organization of the coexpression network. This double connectivity reveals, in mammals, a very basic gene regulation rule between genes involved in specialized function and genes involved in general cellular function. This simple and general rule links a mathematical topological property (double connectivity) to a biological functional property of the genes (general or specialized cellular function). It might result from general mechanisms of the eukaryote evolution involving transcription factors and miRNAs. Thus, the analysis of double connectivity decisively enriches the analysis and interpretation of microarray data and should be useful for interpreting future data in a new way.
The Affymetrix experiments (6 groups out of 130) carried out in our laboratory were supported by grants from La Fondation de France and grants from Le Ministère de la Recherche.
We thank Robert Grant for the translation and editing of the paper, Dr. Michel Beylot for validating our functional biological criteria, Prof. Alain Puisieux, Prof. Charles Dumontet, Dr. Lilian McGregor, and Dr. John McGregor for their permanent interest in our work. We are also grateful to all the researchers that made their microarray data publicly available, to bioinformatics web resource developers, and to the GO consortium. Finally, we thank the two anonymous reviewers for their pertinent comments, which aided us to improve the manuscript.
↵1 The online version of this article contains supplementary material, which includes the list of selected data, the detailed results, the list of the GO functional groups carried out during the meta-analysis, and some lists of top genes according to their function.
Address for reprint requests and other correspondence: M.-P. Gustin, Université Lyon 1, Faculté de Pharmacie, EA4173 - Inserm ERI 22, 8 Ave. Rockefeller, 69373 Lyon Cedex 08, France (e-mail:).
The costs of publication of this article were defrayed in part by the payment of page charges. The article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
- Copyright © 2008 the American Physiological Society