Physiol. Genomics Journal of Applied Physiology
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Physiol. Genomics 25: 1-8, 2006. First published January 3, 2006; doi:10.1152/physiolgenomics.00166.2005
1094-8341/06 $8.00
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Figures
Right arrow All Versions of this Article:
25/1/1    most recent
00166.2005v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (1)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Sivakumar, A.
Right arrow Articles by Holm, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sivakumar, A.
Right arrow Articles by Holm, L.
Received 11 July 2005; accepted in final form 29 November 2005.
Physiological Genomics 25:1-8 (2006)
American Physiological Society © 2006 American Physiological Society

Perspectives

From sequences to a functional unit

Ashwin Sivakumar, Christopher Wilton and Liisa Holm

Institute of Biotechnology and Department of Genetics, University of Helsinki, Helsinki, Finland


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 Glossary
 THE MODULE SPACE OF...
 THE IDEA OF SEQUENCE-LEVEL...
 SOME MODULES ARE MOBILE
 HIERARCHY IN FUNCTIONAL...
 THE ESCHERICHIA COLI STORY
 DISCUSSION
 GENE FUSION EVENTS AS...
 CONCLUSIONS
 GRANTS
 REFERENCES
 
Functional insights at the gene product level would help the drug discovery industry to effectively tap targets for therapeutics and biomedical applications. A complete functional unit can be multidomain, and it is the co-occurrence and interaction of these multiple domains that determine the function and functional diversity of their gene products. With at least 10% of genes from complete genomes existing in fused form, identifying gene fusion events helps us categorize the protein universe into distinct functional units with only sequence information.

drug discovery; component proteins; composite proteins; superfamily; functional annotation; family; subfamily; gene fusion


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 Glossary
 THE MODULE SPACE OF...
 THE IDEA OF SEQUENCE-LEVEL...
 SOME MODULES ARE MOBILE
 HIERARCHY IN FUNCTIONAL...
 THE ESCHERICHIA COLI STORY
 DISCUSSION
 GENE FUSION EVENTS AS...
 CONCLUSIONS
 GRANTS
 REFERENCES
 
THE SPEED OF THE DRUG DISCOVERY process is often influenced by computational insights into protein function. The drug discovery process would be helped by information on roles played by complete functional units, reactions they catalyze, and residues responsible for the specificity. There are a number of computational domain classification schemes (2, 20), which decompose proteins into families of domains. The goal of domain classifications such as PFAM and ADDA is maximal unification of structurally conserved domains. Domain boundaries are placed around recurrent blocks as seen in multiple alignments of the sequences. Ideally, these domain boundaries coincide with the boundaries of structural domains. As families are represented by multiple alignments, the alignment implies comparative models whenever there is a member of known structure in the family. These domain family classifications are thus useful for structure prediction. Members of a domain family typically share a generic molecular function such as involvement in protein-protein interactions. A domain family may contain domains that are related by gene duplication. Duplicated genes often have divergent functions because the duplicated copy might evolve a new function or affect the function of its interacting partner. Moreover, the physiological function of a gene product can be more than the sum of its constituent domains. We claim that a complete functional unit can be multidomain, and it is the co-occurrence and interaction of these multiple domains that determine the function and functional diversity of their gene products. We propose a classification of complete functional units—we call these complete functional units modules.

A fundamental principle is that each module must occur as a whole gene product in some genome. The novel scheme cuts sequences in fewer places compared with domain classification algorithms, which place domain boundaries strictly around segments that can be multiple aligned and either leave flanking segments unassigned or classify them into "singleton" families. Modules include the flanking segments, unless the segment occurs as a complete gene product elsewhere. Modules are basically complete proteins by themselves but occur as components of larger fused proteins (composite proteins). Thus we only undo gene fusion events where there is a clearly traceable origin of one contemporary gene product as the sum of two (or more) complete contemporary gene products. Consequently, modules can be made of one or more domains. We propose that within each module superfamily there are sequence patterns that specifically determine the molecular function of functionally divergent families.

A reliable functional classification schema of protein sequences at clearly defined levels of hierarchy would be an important tool in functional genomics. This tool would screen potential functional information for any given protein. Functional annotations available through whole genome analysis and traditional homology-based methods such as Blast can be erroneous because of a number of possibilities (4, 16). Besides, some annotators tend to annotate the domain while others annotate the gene product, thus allowing error propagation while assigning putative function through sequence similarity. The target protein might be multidomain; thus there is a possibility of the gene product's function being different from a domain's function. The multidomain proteins are especially tricky to annotate because there are cases in which these multidomain proteins have been formed by the fusion of two or more unrelated component proteins that have different ancestors (type I) (33, 43), whereas there are also fused proteins that are made of component proteins having the same evolutionary histories (type II, which are a product of internal gene duplication, retroposition, lateral gene transfers, de novo extension of existing gene). The mechanisms of gene structure evolution are elucidated in detail elegantly by Long et al. (35). The nature and function of type I proteins are not clearly understood. Such instances have been cited in experimental reports (10, 15, 24, 28, 45, 50) to introduce a novel function in the fused protein. To produce a relevant functional classification system of gene products, it would be necessary to be able to confidently group sequences into superfamilies and families, which are a product of a common evolutionary event, from which point on they evolve under common functional selection pressure. This can be done precisely if we can deal with gene fusion events and represent the proteins as complete functional units.

A pivotal resource for the challenges just addressed would be a hierarchical functional classification schema of gene products or complete functional units (modules). There we introduce a concept with which we could assign patterns at a family level to distinctive functional categories. For example, Gene Ontology (GO) (6, 31) catalogues relevant functional categories. The proposed system is likely to generate a "useful" evolutionary scenario in which the number of superfamilies would theoretically correspond to the number of ancestral proteins responsible for the current set of contemporary modules. The estimated number of contemporary superfamilies is around 1,000, within which the homologs are likely to have a variety of cellular and molecular functions (5, 9). This system is aimed at systematically reconstructing the much-required functional scenario by an automated sequence-based approach.


    Glossary
 TOP
 ABSTRACT
 INTRODUCTION
 Glossary
 THE MODULE SPACE OF...
 THE IDEA OF SEQUENCE-LEVEL...
 SOME MODULES ARE MOBILE
 HIERARCHY IN FUNCTIONAL...
 THE ESCHERICHIA COLI STORY
 DISCUSSION
 GENE FUSION EVENTS AS...
 CONCLUSIONS
 GRANTS
 REFERENCES
 
Superfamily
Homologous group of proteins having a common ancestor

Family
Smaller group of sequences within a superfamily, which share a function specific to this group

Sequence pattern/motifs
Regular expressions that can be represented as prositelike patterns

Specificity motifs
Motifs or sequence patterns specific to a family that usually contain residues involved in the functional specificity of the family. Patterns are considered specific if the pattern, when searched against any nonredundant database of proteins such as Uniprot, gives the same subset with which we started.

Paralogous groups/gene families
Groups of sequences within a genome thought to have a common evolutionary origin or ancestor. These genes have arisen by duplication of an ancestral gene.

Multimodular protein/composite protein
Protein that exists as a fused gene in one or more genomes while occurring as two or more separate homologous genes in the same/another genome

Module/component protein
Complete functional protein that may exist as one component of a fused gene in the same/another genome

Gene fusion event
Occur through merging of two or more contemporary sequences. These genes need not be adjacent to each other on the chromosome. Thus gene proximity is not a requirement. This is different from another usage of the term gene fusion, where genes are required to be adjacent to each other.


    THE MODULE SPACE OF "COMPONENT" PROTEINS
 TOP
 ABSTRACT
 INTRODUCTION
 Glossary
 THE MODULE SPACE OF...
 THE IDEA OF SEQUENCE-LEVEL...
 SOME MODULES ARE MOBILE
 HIERARCHY IN FUNCTIONAL...
 THE ESCHERICHIA COLI STORY
 DISCUSSION
 GENE FUSION EVENTS AS...
 CONCLUSIONS
 GRANTS
 REFERENCES
 
The aim of the module concept is to identify and separate fused proteins. A protein is multimodular ("composite") if two or more of its nonoverlapping regions align with two or more homologous proteins ("component"). Serres and Riley (47) have discussed in detail how to identify modules by comparing a small data set of bacterial and archaeal protein sequences. They also used experimental information to identify modules. An example of a module is shown in Fig. 1. The identification of modules can be automated by performing an all-against-all pairwise alignment of all sequences available in nonredundant databases where alignment regions are not allowed to overlap (Fig. 2A).


Figure 1
View larger version (30K):
[in this window]
[in a new window]
 
Fig. 1. FADb. Escherichia coli FADb {alpha}-subunit has 4 experimentally verified enzymatic activities associated with it. Two of the activities, enoyl-CoA hydratase (EC 4.2.1.17) and 3-ohacyl-CoA epimerase (EC 5.1.2.3) are carried out by the same NH2-terminal active site. This complex has 2 modules. There is 1 structural domain associated with module I, which is enoyl-CoA hydratase (crotonases), whereas module II comprises 3 structural domains. The other 2 known activities (3-hydroxyacyl-CoA dehydrogenase, {Delta}3-cis-{Delta}2-trans-enoyl-CoA isomerase) are carried out by module II.

 

Figure 2
View larger version (16K):
[in this window]
[in a new window]
 
Fig. 2. A: identification of modules. A "nonoverlapping" alignment between 4 proteins is shown. Protein A is a fused protein, as 2 contemporary proteins (proteins C and D) align to distinct nonoverlapping regions within protein A. Now all the component proteins (proteins B, C, D) can be referred to as modules, whereas protein A comprises two separate modules (module I and module II). B: clustering of module space to superfamilies. The green oval represents the module space. Here the 4 proteins (A–D) from the example in A are represented as 5 modules. Protein A has been cut into 2 separate modules; {infty} represents other modules making up the module space. This module space can be clustered into n number of distinct superfamilies each having modules sharing a common ancestor. The orange oval represents 1 such superfamily into which proteins from the example in A are clustered. There are 2 possible scenarios. If modules I and II from protein A were homologous, they would be clustered into the same superfamily (case I), whereas if module II of protein A was not homologous, only module I of protein A would be a part of the superfamily, along with proteins B and C, which were unimodular. This scenario is represented by case II. In this case, the modules are divided into 2 distinct superfamilies. C: modularization of protein space. Modules are defined for protein A based on n pairwise alignments with other proteins at a given log e-value cutoff. The first 2 alignments give rise to 2 distinct module definitions; subsequently, as the remaining alignments overlap with existing modules, the module definitions are extended according to these alignments on a progressive basis. In a particular alignment case (striped bar), there is a conflict with more than 1 existing module, so the alignment is ignored. M1 and M2 are the final module definitions, and hence protein A is bimodular. We defined an input set of 237,873 nonredundant protein sequences obtained from all genomes available in NCBI genomes by FTP (ftp://ftp.ncbi.nlm.nih.gov/genomes). The computed modules were calculated for three different log e-value cutoffs (10–2, 10–5, 10–9): 10–5 is a safe optimal default used in database searches, whereas 10–2 and 10–9 were also tested to note the trends in module definition. We also experimented with terminal cutoffs of 40 and 100: using both the complete data set and just the E. coli K12 genome, we observed that varying the terminal cutoff has negligible effect on module definitions. We also observed that the total number of modules and fused genes were considerably higher at the log e-value cutoff 10–5 compared with 10–9; this is primarily due to the larger set of alignments available at the less stringent cutoff of 10–5.

 
We use a precomputed internal database of nonredundant sequences aligned all against all. Starting with the shortest sequence in the database, we apply a set of rules to define the modules. 1) All sequences <80 residues long are ignored to minimize chances of fragments being mistaken for functional proteins. 2) We choose an e-value cutoff of 10–5 for the alignments [other cutoffs were also tested (Tables 1 and 2)], and the alignments with e-values above this cutoff are ignored. This step helps us obtain clear boundaries between component proteins aligning to composite proteins, which help in reducing noise (Fig. 2A). 3) For each alignment, if either of the unaligned terminal ends of the subject is >40 residues long, the alignment is discarded. The change in terminal cutoffs has very little effect on the module definitions (Tables 1 and 2). 4) The remaining alignments are then sorted from shortest to longest. For each alignment there are three possible scenarios. If there is no overlap with any existing module definition, we define a new module for the query based on this alignment (Fig. 2C). If there is an overlap with one existing module definition, the module boundaries may be extended according to the new alignment. If there is an overlap with more than one existing module, we discard the alignment and proceed to the next.


View this table:
[in this window]
[in a new window]
 
Table 1. Summary of module space

 

View this table:
[in this window]
[in a new window]
 
Table 2. Summary of modules from Escherichia coli K12 genome

 
A case of amino-transferases is discussed below where a larger protein marked in red (see Supplemental Fig. IV; available at the Physiological Genomics web site)1 aligns in part to sequences (marked as yellow and blue) in a distinct region. Nevertheless, the larger protein is still a single module, as it does not align to a separate protein in the remaining nonoverlapping region. We propose to build a so-called module space after dealing with gene fusions with this protocol. This would help us group not just homologous domains but also homologous sequences. Homologous sequences are proteins that have evolved new functions but share a common ancestor. Consequently, homologous sequences would share a common domain. New domains (if any) contained within such sequences are recruited by various natural evolutionary events (35).


    THE IDEA OF SEQUENCE-LEVEL SUPERFAMILY
 TOP
 ABSTRACT
 INTRODUCTION
 Glossary
 THE MODULE SPACE OF...
 THE IDEA OF SEQUENCE-LEVEL...
 SOME MODULES ARE MOBILE
 HIERARCHY IN FUNCTIONAL...
 THE ESCHERICHIA COLI STORY
 DISCUSSION
 GENE FUSION EVENTS AS...
 CONCLUSIONS
 GRANTS
 REFERENCES
 
The sequence space of component proteins (module space) has first to be clustered into distinct superfamilies. The superfamily in this context would comprise a set of proteins having a common ancestor (sharing a common domain). These sequence-level superfamilies can be categorized with many approaches. An approach we are using is an exhaustive all-against-all pairwise alignment of the module space followed by a clustering of statistically significant pairs into groups or superfamilies by making sure that there is a common motif holding all the members together. For the significant pairs, we make sure that the alignment length extends to at least 80 residues.

Riley and colleagues (46) used a transitive alignment approach in a different context to generate "paralogous groups" in the Escherichia coli K12 genome. They used a pairwise similarity of 175 PAM units as a cutoff for producing significant pairs with a condition that they were aligned over 83 amino acids or more. This approach is useful, but it would miss remote homologs especially when applied to the complete module space comprising modules spanning genomes from various kingdoms.

The proposed sequence-based superfamily would include component proteins with a common evolutionary origin but divergent functions (Fig. 2B). If we speak in terms of enzymes, members of a superfamily classification may belong to different Enzyme Commission (EC) classes; at the top level these are oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases. But members of a given family within a superfamily would belong to the same class and catalyze the same reaction. An interesting case of related amino-transferases is shown in the Supplemental Material (Supplemental Figs. I–III) as an example of a superfamily membership in the proposed classification scenario. We see that there is a perfect classification of these related amino-transferases into functionally distinct families.


    SOME MODULES ARE MOBILE
 TOP
 ABSTRACT
 INTRODUCTION
 Glossary
 THE MODULE SPACE OF...
 THE IDEA OF SEQUENCE-LEVEL...
 SOME MODULES ARE MOBILE
 HIERARCHY IN FUNCTIONAL...
 THE ESCHERICHIA COLI STORY
 DISCUSSION
 GENE FUSION EVENTS AS...
 CONCLUSIONS
 GRANTS
 REFERENCES
 
Multidomain proteins may have multiple functions or at least multiple interaction partners. Multidomain proteins also often evolve into new gene products. Hegyi and Gerstein (23) discuss these issues with multidomain proteins in detail. It is also known that most of the eukaryotic genomes are made of multidomain proteins. The "modular classification" helps us to efficiently deal with such proteins. Because modules are complete proteins by themselves (those proteins that are not a product of a gene fusion event), they would theoretically encode both the catalytic binding site and the smaller substrate, cofactor, and regulator binding pockets. We require that one domain is common throughout the superfamily set, which would mean that modules belonging to a superfamily evolved from a common ancestor. Many modules are mobile in nature. Evolutionary mobile domains ("mobile modules" in Ref. 13) are promiscuous, and thus in the context of the proposed modular classification such modules having mobile domains may be members of more than one superfamily. One such example would be modules that contain Src homology (SH)3 (27, 40) and SH2 (27, 40, 41) domain types whose binding specificity with other domains is responsible for the molecular function of such domains as well as the gene product function of the protein encoding it. Under the proposed classification, SH3 and SH2 domains occur in more than one superfamily. For instance, the kinases under the proposed classification would comprise the Btk, Src, and Fps modules, each containing a number of other domains along with the common kinase catalytic domain (Fig. 3).


Figure 3
View larger version (35K):
[in this window]
[in a new window]
 
Fig. 3. How does the system deal with mobile modules? There are mobile modules in the proposed classification system. For instance, genes (modules) having an SH2 domain can belong to more than 1 superfamily. A superfamily subset comprising genes sharing the SH2 domain is shown. The patterns common to the superfamily hold the superfamily together. The specificity in patterns coming from constituent domains in members of this superfamily helps us characterize specific gene product functions. For instance, the kinases would have patterns specific to their class. Furthermore, the various kinase subtypes would have patterns specific to subtypes.

 

    HIERARCHY IN FUNCTIONAL CLASSIFICATION
 TOP
 ABSTRACT
 INTRODUCTION
 Glossary
 THE MODULE SPACE OF...
 THE IDEA OF SEQUENCE-LEVEL...
 SOME MODULES ARE MOBILE
 HIERARCHY IN FUNCTIONAL...
 THE ESCHERICHIA COLI STORY
 DISCUSSION
 GENE FUSION EVENTS AS...
 CONCLUSIONS
 GRANTS
 REFERENCES
 
The specificity in the gene product function of a complete contemporary sequence is represented at the family level. Clustering of homologous sequences after dealing with gene duplication and fusion events unifies diverse sequences that share sparse signature residues representative of functional diversity among the superfamily members. Although superfamilies and families would be the two generic levels of hierarchy in this system, motif-based subtrees within families might represent further specificities in the biological function of gene products categorized in a family. We expect to see sharp boundaries between families when motifs are examined (as shown in Supplemental Fig. IV). Motifs are sequence patterns that represent the key functional residues.

The superfamilies always share a common fold. A number of approaches (7, 21, 34, 48, 52, 53) have been tested to infer functionally specific families; most of these are based on a multiple alignment of the superfamily. Moreover, many of them are tree-partitioning algorithms (34, 48) usually carried out on distance matrix-based trees. A phylogenetic tree is a model to study evolution of homologous sequences or domains, and it is at the gene product level where homology of complete sequences can be inferred. Tree-based methods to partition domain families are restricted to generating subfamilies of homologous domains, besides which the subfamilies need not be functional, as we associate a protein's function with its gene product. Many of these tree-based approaches to defining functionally specific families are likely to be more useful under the proposed superfamily classification, for instance, the case of amino-transferases discussed above.


    THE ESCHERICHIA COLI STORY
 TOP
 ABSTRACT
 INTRODUCTION
 Glossary
 THE MODULE SPACE OF...
 THE IDEA OF SEQUENCE-LEVEL...
 SOME MODULES ARE MOBILE
 HIERARCHY IN FUNCTIONAL...
 THE ESCHERICHIA COLI STORY
 DISCUSSION
 GENE FUSION EVENTS AS...
 CONCLUSIONS
 GRANTS
 REFERENCES
 
The E. coli K12 genome is a widely studied model system. We compared our definitions (at log e-value cutoff of –5, terminal cutoff = 40) with the manually defined and annotated data available from GenProtEC (46). Eighty-four percent (77 genes) of the fused genes reported at GenProtEC were successfully recovered by our automation protocol. Interestingly, according to our study there are around 632 fused genes (1,427 modules) in E. coli K12 genomes, whereas Serres et al. (46) noted that there are 107 fused genes in E. coli, which are made up of 221 component proteins (modules). Even in 84% of overlapping fusions with that of the Riley data set, we have predicted new modules in many fused genes that could not be predicted in the earlier manual study. These new observations are likely to be due to the fact that our study was based on a much larger data set spanning all complete genomes, which helps defining new modules, whereas Serres et al. (46) defined their modules by comparing prokaryotic genomes.

The complete set of modules including the E. coli K12 modules is available for download at http://ekhidna.biocenter.helsinki.fi/sms/downloads/.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 Glossary
 THE MODULE SPACE OF...
 THE IDEA OF SEQUENCE-LEVEL...
 SOME MODULES ARE MOBILE
 HIERARCHY IN FUNCTIONAL...
 THE ESCHERICHIA COLI STORY
 DISCUSSION
 GENE FUSION EVENTS AS...
 CONCLUSIONS
 GRANTS
 REFERENCES
 
Thomas and colleagues (37, 49) implemented a gene product classification system (Panther) in which in-house curators manually define HMM libraries. We propose an automatic hierarchical classification system, which would present specificity of gene product function with only sequence information. Our families can be very diverse in sequence, as long as they retain the functional motif. The proposed pipeline approach leading to a hierarchical functional classification system would help address the following key issues for the research community.

Improved functional annotation system with levels of hierarchy.
There are various sources of annotation errors as discussed above. The proposed functional classification system would be able to group proteins having the same gene product function at the family level.

This classification system for component proteins can be effectively integrated with expert information from genome-specific databases such as GenProtEC (46) or Ecocyc (26) to assign generic names to superfamilies, wherever information is available. Gene product annotations at family levels could be assigned to those families having sequences whose gene product annotations are available through expert genome-specific annotation sources. Functional predictions made with this system could be validated with structural data wherever available. The robustness in the proposed system is elucidated by the fact that the basis of this functional classification comes from group-specific motifs and residues at the superfamily and family levels of hierarchy. All modules belonging to a superfamily would share a common fold, and all members of a family would have gene products performing identical function under GO (6, 31). This will also aid in automatic annotation of data coming from new genome projects.

A system for studying evolution of superfamilies.
During the course of evolution, a set of single-domain proteins recruited new functions to form the contemporary set of proteins, which are both single- and multidomain proteins. The contemporary proteins have a modular architecture. Thus this necessitates the study of homology at the module level rather than the domain level. Sequences belonging to a domain family are homologous domains but are not necessarily homologous sequences, whereas the members of a superfamily under the proposed system are expected to be homologous sequences with diverse functional roles. Given that evolution preserves modules encoding specific biological functions (51), this would be a system that would be able to characterize these modules, which define a complete function. The family level represents functional specificity of gene products, whereas the superfamilies are representative of folds supporting contemporary functions. Thus, under the proposed classification system, the number of families might represent the number of essential contemporary functions. Theoretically the system could be a good starting point in the characterization of last universal common ancestor (LUCA) because the number of superfamilies could correspond with the number of folds supporting the variety of functional repertoire in contemporary genomes.

Functional residues: the complete set.
Here we define functional families as a subset of sequences within a superfamily likely to have a distinct function compared with the rest of the superfamily. Domain families coincide with functional families only in the case of module families comprising single-domain proteins. Moreover, with domain family approaches, in the drug discovery context it would be difficult to give the complete set of functional residues likely to be responsible for a protein's function at the gene product level. In contrast, with a superfamily of homologous sequences as in the proposed system, we can expect the families to be functional.

For instance, SH2 domains (discussed above), which are usually present in proteins involved in signal transduction, specifically recognize the phosphorylated state of tyrosine residues. This helps in localization of SH2 domain-containing proteins to tyrosine-phosphorylated sites. The interaction of other constituent domains with the SH2 domains in such proteins elucidates the functional distinctiveness of such proteins. In the proposed system this functional distinctiveness can be characterized in terms of family-specific sequence patterns. Another interesting case is that of the amino-transferases (Supplemental Figs. II and III). Here, the proteins involved in porphyrin synthesis do align in part with a complete protein from another genome (for example, the proteins involved in the threonine biosynthesis). But the proteins involved in porphyrin synthesis are still unimodular because there is no other contemporary protein aligning to the other nonoverlapping region. This scenario is likely to be a case of domain shuffling where a new function has been recruited by the amino-transferase subclasses involved in porphyrin synthesis. This is an example of a superfamily including both single- and multidomain proteins. Clustering of such superfamilies into families will give functional families with the complete set of residues, which are likely to be of functional importance.

The conserved residues within a family are the complete set of putatively functional residues responsible for the specificity of function at the gene product level. Functional residues are far less susceptible to mutation events than nonfunctional residues during evolution. Given a superfamily, patterns or motifs specific to a given family are likely to represent functional patterns encoding functional residues. We refer to these as specificity patterns or motifs. Specificity-determining residues (SDR; also referred to as specificity-determining positions) are positions that are well conserved within a group but differ between groups. Given a superfamily, SDR for constituent families can be mined with many approaches (19, 38). Those SDR found within the specificity motifs are likely to be the functional residues.


    GENE FUSION EVENTS AS AN APPROACH TO STUDY FUNCTIONAL SPECIFICITY AND INTERACTIONS
 TOP
 ABSTRACT
 INTRODUCTION
 Glossary
 THE MODULE SPACE OF...
 THE IDEA OF SEQUENCE-LEVEL...
 SOME MODULES ARE MOBILE
 HIERARCHY IN FUNCTIONAL...
 THE ESCHERICHIA COLI STORY
 DISCUSSION
 GENE FUSION EVENTS AS...
 CONCLUSIONS
 GRANTS
 REFERENCES
 
Pairs of proteins are generally observed to be interacting and/or functionally associated if they are found as functional proteins in another genome while existing as components in a larger fused protein in another genome (14, 36). Both Marcotte et al. (36) and Ouzounis and colleagues (14) performed earlier studies on domain and gene fusion events, respectively, that were aimed at predicting new interacting and functionally associated proteins. Their studies were done during a period when the pace of the genome projects was just beginning to pick up. They reported a list of predicted interacting pairs of proteins based on observations of stand-alone genes in one genome found as unique components of a fused gene (Rosetta stone sequence) in another genome. In our study, for the proposed classification system, we break up the protein universe into the so-called module space after decomposing fused genes into their respective components (modules) having an individual existence by themselves. Consequently, these Rosetta stone sequences are basically a subset of the module space.

Interestingly, our study using an e-value of 10–5 shows that 10% (Table 1) of the genes arise due to gene fusions (or fission). That is a significant number! Even considering that a fraction of the sequences in public databases might be a product of sequence errors, this gives us an altogether new perspective on the functional repertoire of the protein universe. It will be interesting to study Rosetta stone sequences from the module space, which would be a rich source for enumerating new putative interacting proteins.


    CONCLUSIONS
 TOP
 ABSTRACT
 INTRODUCTION
 Glossary
 THE MODULE SPACE OF...
 THE IDEA OF SEQUENCE-LEVEL...
 SOME MODULES ARE MOBILE
 HIERARCHY IN FUNCTIONAL...
 THE ESCHERICHIA COLI STORY
 DISCUSSION
 GENE FUSION EVENTS AS...
 CONCLUSIONS
 GRANTS
 REFERENCES
 
It is possible to reconstruct a natural evolutionary scenario of contemporary proteins by constructing superfamilies of homologous sequences. Moving beyond molecular function (GO definitions), the proposed approach for constructing superfamilies would give important insights at the level of gene products (6, 31), which would be a key development in functional genomics. Differing from the existing sequence-space classifications of homologous domains, the evolutionary scenario presented with the proposed approach could aid the drug discovery process by suggesting better model systems and better targets. Another welcome development would be a systematic annotation protocol due to the hierarchical nature of the classification. As the issue of remote homology detection (22, 25, 32, 44) consistently continues to improve, this superfamily classification would be a tool that could aid research on ancestral proteins and contribute to the search for LUCA as well as enumeration of theoretically essential functions in the present genomic diversity.


    GRANTS
 TOP
 ABSTRACT
 INTRODUCTION
 Glossary
 THE MODULE SPACE OF...
 THE IDEA OF SEQUENCE-LEVEL...
 SOME MODULES ARE MOBILE
 HIERARCHY IN FUNCTIONAL...
 THE ESCHERICHIA COLI STORY
 DISCUSSION
 GENE FUSION EVENTS AS...
 CONCLUSIONS
 GRANTS
 REFERENCES
 
This work was supported by the Academy of Finland (grant number 1105182).


    ACKNOWLEDGMENTS
 
We thank Margrethe Serres (Bay Paul Center for Molecular Biology and Evolution) and Swapan Mallick and other members of the Structural Genomics Group, Institute of Biotechnology, University of Helsinki for useful discussions.


    FOOTNOTES
 
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).

Address for reprint requests and other correspondence: A. Sivakumar and L. Holm, Institute of Biotechnology and Dept. of Genetics, Univ. of Helsinki, PO Box 56 (Viikinkaari 5), 00014 Helsinki, Finland (e-mail: ashwin.sivakumar{at}helsinki.fi, liisa.holm{at}helsinki.fi).

1 The Supplemental Material for this article (Supplemental Figs. I–IV) is available online at http://physiolgenomics.physiology.org/cgi/content/full/00166.2005/DC1. Back


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 Glossary
 THE MODULE SPACE OF...
 THE IDEA OF SEQUENCE-LEVEL...
 SOME MODULES ARE MOBILE
 HIERARCHY IN FUNCTIONAL...
 THE ESCHERICHIA COLI STORY
 DISCUSSION
 GENE FUSION EVENTS AS...
 CONCLUSIONS
 GRANTS
 REFERENCES
 

  1. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, and Yeh LS. UniProt: the universal protein knowledgebase. Nucleic Acids Res 32: D115–D119, 2004.[Abstract/Free Full Text]
  2. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, and Eddy SR. The Pfam protein families database. Nucleic Acids Res 32: D138–D141, 2004.[Abstract/Free Full Text]
  3. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, and Wheeler DL. GenBank: update. Nucleic Acids Res 32: D23–D26, 2004.[Abstract/Free Full Text]
  4. Brenner SE. Errors in genome annotation. Trends Genet 15: 132–133, 1999.[CrossRef][Web of Science][Medline]
  5. Brenner SE, Clothia C, and Hubbard TJ. Population statistics of protein structures: lessons from structural classifications. Curr Opin Struct Biol 7: 369–376, 1997.[CrossRef][Web of Science][Medline]
  6. Camon E, Barrell D, Lee V, Dimmer E and Apweiler R. Gene Ontology Annotation Database—an integrated resource of GO annotations to the UniProt Knowledgebase. In Silico Biol 4: 5–6, 2004.[Medline]
  7. Casari G, Sander C, and Valencia A. A method to predict functional residues in proteins. Nat Struct Biol 2: 171–178, 1995.[CrossRef][Web of Science][Medline]
  8. Chothia C. Proteins. One thousand families for the molecular biologist. Nature 357: 543–544, 1992.[CrossRef][Medline]
  9. Chumanevich AA, Davies C, and Krupenko SA. Crystallization and preliminary X-ray diffraction analysis of recombinant hydrolase domain of 10-formyltetrahydrofolate dehydrogenase. Acta Crystallogr D58: 1841–1842, 2002.
  10. De la Cruz F and Davies J. Horizontal gene transfer and the origin of species: lessons from bacteria. Trends Microbiol 8: 128–133, 2000.[CrossRef][Web of Science][Medline]
  11. Doolittle RF and Bork P. Evolutionary mobile modules in proteins. Sci Am 269: 50–56, 1993.
  12. Enright AJ, Iliopoulos I, Kyrpides NC, and Ouzounis CA. Protein interaction maps for complete genomes based on gene fusion events. Nature 402: 86–90, 1999.[CrossRef][Medline]
  13. Fioretos T, Panagopoulos I, Lassen C, Swedin A, Billstrom R, Isaksson M, Strombeck B, Olofsson T, Mitelman F, and Johansson B. Fusion of the BCR and the fibroblast growth factor receptor-1 (FGFR1) genes as a result of t(8;22)(p11;q11) in a myeloproliferative disorder: the first fusion gene involving BCR but not ABL. Genes Chromosomes Cancer 32: 302–310, 2001.[CrossRef][Web of Science][Medline]
  14. Galperin MY and Koonin EV. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol 1: 55–67, 1998.[Medline]
  15. Gerlt JA and Babbitt PC. Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. Annu Rev Biochem 70: 209–246, 2001.[CrossRef][Web of Science][Medline]
  16. Gogarten JP, Doolittle WF, and Lawrence JG. Prokaryotic evolution in light of gene transfer. Mol Biol Evol 19: 2226–2238, 2002.[Abstract/Free Full Text]
  17. Gu X. Functional divergence in protein (family) sequence evolution. Genetica 118: 133–141, 2003.[CrossRef][Web of Science][Medline]
  18. Heger A, Wilton CA, Sivakumar A, and Holm L. ADDA: a domain database with global coverage of the protein universe. Nucleic Acids Res 33: D188-D191, 2005.[Abstract/Free Full Text]
  19. Heger A and Holm L. Sensitive pattern discovery with "fuzzy" alignments of distantly related proteins. Bioinformatics 19, Suppl 1: I130–I137, 2003.
  20. Heger A, Lappe M, and Holm L. Accurate detection of very sparse sequence motifs. J Comput Biol 11: 843–857, 2004.[CrossRef][Web of Science][Medline]
  21. Hegyi H and Gerstein M. Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. Genome Res 11: 1632–1640, 2001.[Abstract/Free Full Text]
  22. Horiuchi Y, Kawaguchi H, Figuueroa F, O'hUigin C, and Klein J. Dating the primigenial C4-CYP21 duplication in primates. Genetics 134: 331–339, 1993.[Abstract]
  23. Hur A and Burtlag D. Remote homology detection: a motif based approach. Bioinformatics 19, Suppl 1: 26–33, 2003.
  24. Keseler IM, Collado-Vides J, Gama-Castro S, Ingraham J, Paley S, Paulsen IT, Peralta-Gil M, and Karp PD. EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res 33: D334–D337, 2005.[Abstract/Free Full Text]
  25. Koch CA, Anderson D, Moran MF, Ellis C, and Pawson T. SH2 and SH3 domains: elements that control interactions of cytoplasmic signaling proteins. Science 252: 668–674, 1991.[Abstract/Free Full Text]
  26. Kraussa V and Reutera G. Two genes become one: the genes encoding heterochromatin protein SU(VAR)3–9 and translation initiation factor subunit elF-2 are joined to a dicistronic unit in holometabolic insects. Genetics 156: 1157–1167, 2000.[Abstract/Free Full Text]
  27. Kumar S, Tamura K, and Nei M. MEGA3: integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment. Brief Bioinform 5: 150–163, 2004.[Abstract/Free Full Text]
  28. Lawrence JG. Gene transfer in bacteria: speciation without species? Theor Popul Biol 61: 449–460, 2002.[CrossRef][Web of Science][Medline]
  29. Lee V, Camon E, Dimmer E, Barrell D, and Apweiler R. Who tangos with GOA? Use of Gene Ontology Annotation (GOA) for biological interpretation of "-omics" data and for validation of automatic annotation tools. In Silico Biol 5: 5–8, 2005.[Medline]
  30. Li W, Jaroszewski L, and Godzik A. Sequence clustering strategies improve homology recognitions while reducing search times. Protein Eng 15: 643–649, 2002.[Abstract/Free Full Text]
  31. Liang P, Labedan B, and Riley M. Physiological genomics of Escherichia coli protein families. Physiol Genomics 9: 15–26, 2002.[Abstract/Free Full Text]
  32. Lichtarge O, Yao H, Kristensen DM, Madabushi S, and Mihalek I. Accurate and scalable identification of functional sites by evolutionary tracing. J Struct Funct Genomics 4: 159–166, 2003.[CrossRef][Medline]
  33. Long M, Betran E, Thornton K, and Wang W. The origin of new genes: glimpses from the young and old. Nat Genet 4: 865–875, 2003.[Web of Science][Medline]
  34. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, and Eisenberg D. Detecting protein function and protein-protein interactions from genome sequences. Science 285: 751–753, 1999.[Abstract/Free Full Text]
  35. Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, Kitano H, and Thomas PD. The Panther database of protein families, sub-families, function and pathways. Nucleic Acids Res 33: D284-D288, 2005.[Abstract/Free Full Text]
  36. Mirny LA and Gelfand MS. Using orthologous and paralogous proteins to identify specificity-determining residues in bacterial transcription factors. J Mol Biol 321: 7–20, 2002.[CrossRef][Web of Science][Medline]
  37. Nahum LA and Riley M. Divergence of function in sequence-related groups of Escherichia coli proteins. Genome Res 11: 1375–1381, 2001.[Abstract/Free Full Text]
  38. Pawson T and Schlessinger J. SH2 and SH3 domains. Curr Biol 3: 434–442, 1993.[CrossRef][Web of Science][Medline]
  39. Pawson T. Specificity in signal transduction: from phosphotyrosine-SH2 domain interactions to complex cellular systems. Cell 116: 191–203, 2004.[CrossRef][Web of Science][Medline]
  40. Rigoutsos I and Floratos A. Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 14: 55–67, 1998.[Abstract/Free Full Text]
  41. Riley M and Labedan B. Protein evolution viewed through Escherichia coli protein sequences: introducing the notion of a structural segment of homology, the module. J Mol Biol 268: 857–868, 1997.[CrossRef][Web of Science][Medline]
  42. Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, and Altschul SF. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29: 2994–3005, 2001.[Abstract/Free Full Text]
  43. Schindler T, Bornmann W, Pellicena P, Miller TW, Clarkson B, and Kuriyan J. Structural mechanism for STI-571 inhibition of Abelson tyrosine kinase. Science 289: 1938–1942, 2000.[Abstract/Free Full Text]
  44. Serres MH, Goswami S, and Riley M. GenProtEC: an updated and improved analysis of functions of Escherichia coli K-12 proteins. Nucleic Acids Res 32: D300-D302, 2004.[Abstract/Free Full Text]
  45. Serres M and Riley M. Structural domains, protein modules, and sequence similarities enrich our understanding of the Shewanella oneidensis MR-1 proteome. OMICS 8: 306–321, 2004.[Medline]
  46. Sjolander K. Phylogenetic inference in protein superfamilies: analysis of the SH2 domains. Proc Int Conf Intell Syst Mol Biol 6: 165–174, 1998.[Medline]
  47. Thomas PD, Kejariwal A, Campbell MJ, Mi H, Diemer K, Guo N, Ladunga I, Ulitsky-Lazareva B, Muruganujam A, Rabkin S, Vandergriff JA, and Doremiex O. Panther: a browsable database of gene product organized by biological function, using curated protein family and sub-family classification. Nucleic Acids Res 31: 334–341, 2003.[Abstract/Free Full Text]
  48. Thomson TM, Lozano JJ, Loukili N, Carrio R, Serras F, Cormand B, Valeri M, Diaz VM, Abril J, Burset M, Merino J, Macaya A, Corominas M, and Guigo R. Fusion of the human gene for the polyubiquitination coeffector UEV1 with Kua, a newly identified gene. Genome Res 10: 1655–1657, 2000.[Free Full Text]
  49. Vespignani A. Evolution thinks modular. Nat Genet 35: 118–119, 2003.[CrossRef][Medline]
  50. Wicker N, Dembele D, Raffelsbeger W, and Poch O. Density of points clustering, application to transcriptomic data analysis. Nucleic Acids Res 30: 3992–4000, 2002.[Abstract/Free Full Text]
  51. Wicker N, Perrin GR, Thierry JC, and Poch O. Secator: a program for inferring protein sub-families from phylogenetic trees. Mol Biol Evol 18: 1435–1441, 2001.[Abstract/Free Full Text]
  52. Wu CH, Yeh LSL, Huang H, Arminski L, Castro-Alvear J, Chen Y, Hu ZZ, Kourtesis P, Ledley RS, Suzek BE, Vinayaka CR, Zhang J, and Barker WC. The Protein Information Resource. Nucleic Acids Res 31: 345–347, 2003.[Abstract/Free Full Text]



This article has been cited by other articles:


Home page
BioinformaticsHome page
A. Heger, S. Mallick, C. Wilton, and L. Holm
The global trace graph, a novel paradigm for searching protein sequence databases
Bioinformatics, September 15, 2007; 23(18): 2361 - 2367.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Figures
Right arrow All Versions of this Article:
25/1/1    most recent
00166.2005v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (1)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Sivakumar, A.
Right arrow Articles by Holm, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sivakumar, A.
Right arrow Articles by Holm, L.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
Visit Other APS Journals Online
Copyright © 2006 by the American Physiological Society.