|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Call For Papers: Comparative Genomics
1 Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Melbourne; Australia
2 School of Engineering and Information Technology, Deakin University, Melbourne; Australia
3 Centre for Cancer Research, Monash Institute of Medical Research, Monash University, Melbourne, Australia
| ABSTRACT |
|---|
|
|
|---|
transcription factor
| INTRODUCTION |
|---|
|
|
|---|
In recent years, many computational approaches have been proposed to find putative cis-regulatory elements using diverse algorithms. To evaluate the accuracy of the algorithms, computationally discovered cis-regulatory elements are compared with the known TFBSs from public or propriety databases or published literatures. This type of evaluation, however, does not validate or associate the putative cis-regulatory elements with biological functions. In this study we take advantage of gene ontology (GO) annotations to validate the putative cis-regulatory elements. GO categories of transcription factors are compared with the GO categories of genes whose promoters contain cis-regulatory elements putatively bound by these transcription factors. Transcription factors and their regulated genes are assumed to share some common GO categories. Another disadvantage of the other studies is that they did not investigate the relationships between the putative cis-regulatory elements. The expression of a gene is usually not regulated by a single transcription factor, but by clusters of transcription factors that might bind to different cis-regulatory elements. Therefore by exploring the combinatorial regulation of gene expression, one can obtain a better understanding of the complex gene regulation machinery. This study will endeavor to determine the coexistence of putative cis-regulatory elements and will also take advantage of phylogenetic footprinting to improve the prediction accuracy in the search of cis-regulatory elements.
Phylogenetic footprinting has been demonstrated to be a very useful tool in the discovery of evolutionarily conserved cis-regulatory elements (13, 22, 29, 30). Three species, humans, mice, and Drosophila, will be compared to examine the conservation of putative cis-regulatory elements. In addition, this study will limit the search for putative cis-regulatory elements to the promoter regions immediate 5' of transcription start sites (TSSs) because most known cis-regulatory elements are located in the proximal promoter region of genes (2, 25).
Due to the combinatorial nature of transcription regulation, the elucidation of coexistence of cis-regulatory elements is important to the understanding of the mechanisms regulating gene expression (3, 20, 28). This will shed light on how transcription factors are combined to regulate gene expression and will also contribute to the reconstruction of gene regulatory networks. In the past, there have been few studies conducted on the identification of coexisting cis-regulatory elements using computational approaches (20). This is partly due to the complexity of this problem and the lack of enough data. In this study we take advantage of the UCSC genome database to analyze a large set of gene promoters to extract the correlation information among clusters of putative cis-regulatory elements.
Past studies using GO have focused on the GO annotation for individual genes (7, 29, 31). Many genes have been annotated with gene ontology categories in the GO database (10). However, biological processes are usually implemented by a set of genes that interact with each other. Annotation to individual genes, therefore, cannot be used to interpret the complex interactions among them correctly. This study correlates the GO categories of transcription factors with the GO categories of putatively regulated genes by taking advantage of gene annotation in GO database. To our knowledge, this is the first study to report the matching of the biological functions of transcription factors with the biological functions of putatively regulated genes. This also serves as a validation of the discovered putative cis-regulatory elements because transcription factors are correlated to the putatively regulated genes through these elements in this study.
| METHODS |
|---|
|
|
|---|
Calculation of cis-regulatory element factor.
For each species, all gene promoters were aligned to their TSSs. These aligned promoters were then divided into 50 bins, each of which contained 20 bp from each promoter. The search for the putative cis-regulatory elements was allowed to cross bin boundaries.
All possible 8-mer sequences (65,536 sequences) were investigated for the gene promoter set from each species because only single strand DNA were available for each gene in the dataset. To calculate the statistical significance of an 8-mer sequence s, we first calculated its number of occurrences in each bin. This number was calculated as the sum of each occurrence number of s appearing in each promoter at that bin. For a bin b in a promoter p, the occurring of s in b meant s was a substring of p and the first letter of s was located in bin b. So if s fell on a bin boundary, the position of s's first letter would determine which bin it belonged to. A promoter was counted multiple times if s appeared multiple times in that promoter. The Cis-regulatory element factor (CREf) of s was then calculated as:
![]() | (1) |
was the median value of all occurrence numbers of s in all bins. x3 and x1 were the third quartile and first quartile of occurrence numbers of s in all bins, respectively.
Calculation of Z-score.
For each species, 1,000 random sets of gene promoters were generated to evaluate the likelihood that an 8-mer sequence obtained its statistical significance (CREf) by chance. We used the seventh-order Markov model to generate random data sets (9). Each random set for humans contained 17,407 gene promoters of length 1,000. Each promoter started with a random 7-mer sequence, and the next base was determined randomly under the condition that frequencies of 8-mer sequences in the real human data set were maintained. This process was repeated until the entire promoter sequence was completed. The same procedure was applied to gene promoters from mice and Drosophila to generate their background sets of promoters. A CREf value of an 8-mer sequence was calculated on each random set of gene promoters. The Z-score of an 8-mer sequence, representing the level of confidence that the statistical significance of this 8-mer sequence was not obtained randomly, was then calculated based on its CREf value on the real set of promoters and CREf values on 1,000 random sets of promoters using the following equation:
![]() | (2) |
Calculation of P value.
To test the statistical significance of the number of common GO categories found between GO categories for transcription factors and GO categories for putatively regulated genes in Table 3, we calculated the probability of having m members in the intersection of two subsets taken from the entire set of GO categories.
For two subsets containing s1 and s2 members, respectively, the number of possible combinations in which they have m members in common was
![]() |
![]() |
![]() | (3) |
| RESULTS |
|---|
|
|
|---|
Figure 1 outlined the process of identifying cis-regulatory elements using a combination of statistical analysis and phylogenetic footprinting techniques. A simple statistical algorithm was first applied to the set of promoters for each species. The CREf and Z-score, which represented the statistical significance of an 8-mer sequence and the likelihood of attaining that significance by chance, were calculated for each 8-mer sequence for each species (see METHODS for details). The 8-mer sequences meeting the selection criteria for each species were then selected for the next step of processing.
|
3, 2) Z
2, and 3) MaxBin
20; MaxBin is the maximal occurrence number of any 8-mer sequence in all bins in the real promoter set. The CREf value describes the strength of a peak in the frequency distribution of an 8-mer sequence (see METHODS). If an 8-mer sequence has a high abundance at a particular bin (e.g., bin 48 for TATA box) and low abundance at other bins, there would be a strong peak shown at that particular bin. It was our assumption that an exceptional peak at a particular location implied some biological functions. This assumption was confirmed by the known sites, an example of which is the binding of TATA box binding protein (TBP) to the core promoter regions of TATA-containing genes. Here, the binding site is located at –25 to –30 bp upstream from TSSs. Strong peaks were detected in bin 48 (–20 to –40 bp upstream from TSSs) for TATA box elements in this study (see Table 1). A Z-score was used to calculate the likelihood of obtaining a CREf value by chance for an 8-mer sequence. The calculation of Z-scores was based on 1,000 randomly generated promoter sets for each species (see METHODS). Any 8-mer sequences that had very low abundance in all bins were filtered out with MaxBin values. This is because these sequences were found in a very small number of genes, and it would be hard to tell if such 8-mer sequences were statistically or biologically significant in such a high-throughput analysis.
|
These elements only accounted for
0.2% of all studied 8-mer sequences; however, they appear in the majority of gene promoters in each species (89.4% in humans, 86.6% in mice, and 82.1% in Drosophila). The calculation of the statistical significance of these 20 clusters in the area of 1,000–2,000 bp upstream from TSSs in each species revealed that up to 80% of 124 elements did not show statistical significance at all, and the remaining 20% showed much lower significance than in the area of 1,000 bp from TSSs (data not shown).
Coexistence of clusters of cis-regulatory elements.
Investigations into the coexistence of cis-regulatory elements in gene promoters will shed light on the understanding of the combinatorial regulation mechanism of gene expression. Table 2 shows the statistical results of coexistence of up to five clusters of cis-regulatory elements in human gene promoters. For each cluster (e.g., TATA), we have shown combinations of clusters (e.g., TATA, SP1) appearing in >20% of gene promoters that contain one or more putative cis-regulatory elements from that seeding cluster (TATA). Results were obtained by first fixing a seeding cluster and then looking for the clusters that could be combined with it. For example, cluster C would be combined with the seeding cluster S if at least 20% of human gene promoters that contained one or more elements from S also contained one or more elements from C. A third cluster, T, would be combined with C and S, if at least 20% of human gene promoters containing one or more elements from S contained not only one or more elements from C but also one or more elements from T.
|
GO analysis.
We examined the relationship between GO categories of transcription factors binding to the putative cis-regulatory elements from a cluster and GO categories of genes (regulated genes) whose promoters contained at least one of these elements from the same cluster. We hypothesized that if found cis-regulatory elements and the clusters they were grouped into were correct, the GO categories of transcription factors binding to the cluster of elements should match the GO categories of corresponding regulated genes (7, 10, 29, 31).
For each cluster of elements, we first determined RefSeq names of human genes whose promoters contained one or more elements from that cluster, and then converted RefSeq names to HUGO symbols using MatchMiner (5). HUGO symbols of genes were then submitted to High-Throughput GOMiner to mine the GO categories (32). The false discovery rate threshold was set to be 0.1. Two files were analyzed by High-Throughput GOMiner: one was the total genes file including the list of all the genes, the other was the changed genes file, which was a zip file including 20 lists of genes for all the clusters. Names of genes in these two files were HUGO symbols. The result from High-Throughput GOMiner was then clustered by Genesis using hierarchical clustering and Pearson correlation (23). The result of clustering is shown in Fig. 2. The number of GO categories for the regulated genes for each cluster is shown in Table 3. Details of these GO categories can be seen in Supplemental Table S1 (Additional File 1).1
|
|
Results showed matches in 12 out of 15 clusters that have known transcription factors. No matches were found in three clusters with known transcription factors: TATA, Y box, and LSF. For TATA, TBP has the main function of initiating gene expression, which in fact bears no relevance with the functions of regulated genes. For Y box and LSF, only a small number of GO categories resulted for either transcription factors or regulated genes. This led to no correlation between them. This was probably brought about by the inaccuracy of the software (High-Throughput GOMiner and Genesis) or by the lack of knowledge on the function of transcription factors or regulated genes.
Five new clusters of putative cis-regulatory elements appeared in a large number of human gene promoters; however, only a few GO categories were able to be assigned to them. Also no GO categories were found for clus16, clus18, and clus19. This suggests that these elements, which have high statistical significance, could have some biological functions that have not yet been recognized.
In summary, common GO categories between transcription factors and regulated genes have been shown in the majority of clusters with known transcription factors (see Table 3). This demonstrates the putative cis-regulatory elements discovered in this study and their groupings are biologically meaningful. Details of the common GO categories can be seen in Supplemental Table S3 (Additional File 3).
Comparison with other lists of putative cis-regulatory elements or motifs.
Of the 124 putative cis-regulatory elements found in this study, only 24 sequences (19.4%) were also identified by FitzGerald et al. (9). Our statistical algorithm is similar to FitzGerald's algorithm from the point of view of splitting promoters into bins and calculating the statistical significance of 8-mer sequences according to their occurrence frequencies in different bins. However, three key differences led to very different results. The first is the median value used in this method rather than the mean value used in the research by FitzGerald et al. Median value is believed to be better than mean value in finding wild values in a sequence of numbers in statistics. Thus putative cis-regulatory elements discovered in this study were potentially more specific; The second is a larger set of human promoters used in this study (17,407 human promoters in this study and only 13,010 human promoters in FitzGerald et al.'s study). For obvious reasons, prediction accuracy is improved by investigating more genes. The last and the most important difference is the phylogenetic footprinting technique used in this study. This technique helped to eliminate 8-mer sequences that had high statistical significance in human promoters but were not conserved across the three species of humans, mice, and Drosophila. Conservation is an inherent feature of cis-regulatory elements, and we believe that conservation analysis could greatly reduce false positives in the discovery of putative cis-regulatory elements.
Out of 124 elements, 83 (66.9%) matched the motifs found by Xie et al. (30). In their study, Xie et al. searched motifs of diverse lengths. A cis-regulatory element identified in our study would be deemed as a match with a motif identified by Xie et al. if the cis-regulatory element is one instance of the motif or an instance of the motif is part of the element. Xie et al. conducted a comparative analysis of the human, mouse, rat, and dog genomes, which are more closely related in the course of evolution compared with the species selected by this study (humans vs. Drosophila). Comparison between more distantly related species could increase the chance of finding cis-regulatory elements having more fundamental biological roles and of eliminating false positives. The combination of comparisons between both closely related species and distantly related species will not only provide more candidates for examination but also set a more stringent criterion for eliminating those 8-mer sequences only having the statistical significance, which were difficult to filter out by comparing only closely related species. Thus the putative cis-regulatory elements discovered in this study should have more fundamental biological functions and be more accurate than Xie et al.'s finding, even though the number of motifs here was less than that found by Xie et al. (30).
Eighteen out of 124 elements (14.5%) matched the TRANSFAC motifs inferred by Xie et al. (30). Only a small portion of the putative cis-regulatory elements we found matched the TRANSFAC motifs. One of the reasons is perhaps the inaccuracy of these TRANSFAC motifs assembled by Xie et al., which might not match well with the TRANSFAC database. Another reason is possibly the stringent criteria for selecting candidate cis-regulatory elements in this study, which might filter out some real cis-regulatory elements.
Put together, 73.4% of 124 elements discovered in this study were also found by other studies. Four putative cis-regulatory elements in the five new clusters matched the motifs identified in Xie et al.'s (30) study, which also found no known transcription factors.
| DISCUSSION |
|---|
|
|
|---|
The putative cis-regulatory elements discovered in this study show high abundance at particular sites of human promoters. For example, the overwhelming majority of the elements in the cluster "TATA" show high abundance at bin 48, which is around –20 bp to –40 bp upstream from TSSs. This site is consistent with the assumed location of TATA box in promoters of human genes. The high CREf values these elements have obtained indicate they do not show such high abundance at other sites of the 1-kb-long promoters. Rather, they only show strong peaks at particular sites. Supplemental Figure S1 shows frequency distributions of representative putative cis-regulatory elements that have the biggest CREf values in each cluster. The specificity of these elements is clearly observed in that figure.
When the approach of comparative genomics is used to search DNA sequences that are conserved during the course of evolution, there is always a chance that the identified DNA sequences are in fact general nonspecific sequences that have little or no relevance to the target of the discovery process. However, because the general nonspecific sequences were assumed to be randomly distributed across the genome, they should have little or no statistical significance, and proper statistical algorithms should be able to filter them out. In this study we conducted statistical analysis to filter out those 8-mer sequences that have low statistical significance (CREf values) before carrying out the conservation analysis. Comparison between distantly related organisms (humans/mice with Drosophila) will further remove the nonspecific sequences from the resulting list of putative cis-regulatory elements. We are confident that the putative cis-regulatory elements discovered in this study stand out from a large set of candidate cis-regulatory elements because they are subject to selection pressure in the course of evolution.
In this study, GO classification was applied to validate the putative cis-regulatory elements we discovered and the clusters these elements have formed. There are good matches between GO categories of the transcription factors binding to these elements and GO categories of genes whose promoters contain these elements. Some biological evidence has been found that supports these matches. For example, Adnane et al. (1) suggested that GGTI-298-mediated upregulation of p21WAF1/CIP1 involved an increase in the amount of DNA-bound Sp1–Sp3 and enhancement of Sp1 transcriptional activity. The dominant negative mutant of the small GTPase RhoA was able to activate p21WAF1/CIP1 and constitutively active RhoA repressed p21WAF1/CIP1. In this study, the small GTPase RhoA was also found in the list of genes putatively regulated by Sp1, and they share the same GO category "small GTPase mediated signal_transduction" (GO:0007264). Additionally the location of the Sp1 binding site Adnane et al. found is consistent with the location of Sp1 binding sites determined by this study.
The GO category "peripheral nervous system development" is one of the common categories found for EGR and its putatively regulated genes (cluster 12). Nerve growth factor (NGF) plays a critical role in the development and survival of neurons in the peripheral nervous system (21). Warner et al. (26) described that stable expression of Egr2 is specifically associated with the onset of myelination in the peripheral nervous system. Egr-2 or Krox-20 has also been observed in the developing mammalian hindbrain (16).
NRF-1 is a nuclear encoded gene product that has been shown to be important for the transcriptional regulation of multiple mitochondrial genes involved in organelle biogenesis and cellular respiration. Deletion or mutation of the sequences containing the NRF-1 site at positions –61 bp to –49 bp upstream from TSSs essentially abolished CXCR4 promoter activity. CXCR4 is both a chemokine receptor and entry coreceptor for T-cell line-adapted human immunodeficiency virus type 1 (27). In this study, NRF-1 is found to be sharing the GO category "organelle organization and biogenesis" (GO:0006996)with its putatively regulated genes. Again, the determined NRF-1 binding site is consistent with the binding sites found by Wegner et al. (27).
G/C accounts for 53.4% of promoter regions investigated in this study, which is
6.8% higher than the A/T content in human promoter regions. However, some G/C-rich putative cis-regulatory elements (e.g., some elements in SP1 cluster) are
10 times more abundant than A/T-rich putative cis-regulatory elements (e.g., elements in TATA box cluster). Therefore, the biased G/C content in human promoter regions only makes a very small contribution to the abundance of G/C-rich cis-regulatory elements. We believe that the binding requirements by transcription factors (e.g., SP1, EGR, NRF-1, etc.) in the promoter regions are one of the major factors that could explain the number of G/C-rich elements in these regions. Additionally the peaks shown at particular sites by these elements can also support their assumed role in the binding of transcription factors to the promoter regions of genes.
Positions of cis-regulatory elements in the promoter region are important for them to be identified and bound by transcription factors to regulate the expression of genes. Table 1 shows the position (bin number) of each discovered putative cis-regulatory element where it has the maximum occurrence number in all bins. The TATA box is well known to be located –25 bp to –30 bp upstream from the TSSs. This study shows that the overwhelming majority of regulatory elements in the TATA cluster is located at bin 48, which is –20 bp to –40 bp upstream from TSSs. This supports the known TATA box location.
Most of the elements in the CCAAT cluster lie in bin 45 and 46, i.e., –100 bp to –60 bp upstream from TSSs. Analysis of 5'-deletion and substitution mutants in HeLa nuclear extracts has shown that the basal activity of the promoter depends primarily on a CCAAT box sequence located at –65 (17). CCAAT box, located between –72 and –77 relative to TSSs, is one of the three regions where mutations would result in a significant decrease in the level of transcription (18). AP-1 (HeLa cell-activating protein 1) sites residing within two promoter elements of the osteocalcin gene bind the Fos-Jun protein complex: the osteocalcin box (OC box; nucleotides –99 to –76), which contains a CCAAT motif as a central element and influences tissue-specific basal levels of osteocalcin gene transcription (19).
SP1 binding sites were found in bin 46, 47, and 48, i.e., –80 bp to –20 bp upstream from TSSs. A conserved Sp1 site was found at –43 bp to –38 bp upstream from TSSs, which is associated with maximum reporter gene activity (8). A sequence at –60 bp binds the transcription factor Sp1 in vitro and in vivo and is essential for CD11b promoter activity (6). Also, the upstream promoter region of the AIDS virus LTR lies between –45 and –77 and contains three tandem, closely spaced SP1 binding sites of variable affinity (14).
NRF-1 binding sites were found to be mainly located in bin 47 and 48, which are –60 bp to –20 bp upstream from TSSs. This supports the known NRF-1 binding sites, i.e., –61 bp to –49 bp upstream from TSSs (27).
Elements in the CRE cluster appear in bin 47 and 48, i.e., –20 bp to –60 bp upstream from TSSs. Cotransfection experiments showed that the cyclin D1 promoter is inducible by c-Jun and that this induction is mediated predominantly through the protected putative CRE at –52 (12). One of the two identified putative cAMP-response elements appeared at position –38. Functional analysis showed that this element is necessary for complete PKA induction (11).
Conclusions
By combining statistical analysis and the phylogenetic footprinting technique, this study yielded 124 putative cis-regulatory elements that not only had high statistical significance but were well conserved across the human, mouse, and Drosophila species. These elements were grouped into 20 clusters, of which 15 clusters had known transcription factors. Examination of the coexistence of these clusters found that SP1, EGR, and NRF-1 were the dominant clusters that appeared most frequently in the combinatorial combination of up to five clusters, implying that the CpG island is an important part of human gene promoters. GO analysis revealed that in most clusters GO categories of transcription factors matched GO categories of regulated genes. However, only a few GO categories have been found for the genes whose promoters contain cis-regulatory elements from the new clusters despite their high statistical significance and conservation across the three species. These elements could potentially represent good candidates for further systematic experimental evaluations.
| ACKNOWLEDGMENTS |
|---|
We thank Barry R. Zeeberg for the help on using MatchMiner, GOMiner, and Genesis; Andrey Shlyakhtenko for the detailed explanation of his TFBS discovery algorithm; Fuchun Huang for the helpful discussion on the statistical algorithm; and Aneta Dowsing, Shamith Samarjiwa, and Lingdi Zhou for improving the English of this manuscript. We are also grateful to the anonymous referees whose comments and suggestions allowed a significant improvement of this work. This study used high-performance computing facilities in APAC and VPAC, Australia.
| FOOTNOTES |
|---|
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).
1 The online version of this article contains supplemental material. ![]()
| REFERENCES |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Visit Other APS Journals Online |