Both small GTPase and its activating protein (GAP) superfamilies exist in various eukaryotes. The small GTPases regulate a wide variety of cellular processes by cycling between active GTP- and inactive GAP-bound conformations. The GAPs promote GTPase inactivation by stimulating the GTP hydrolysis. In this study, we identified 111 small GTPases and 85 GAPs in rice, 65 GAPs in Arabidopsis, 90 small GTPases in Drosophila melanogaster, and 35 GAPs in Saccharomyces cerevisiaeby genome-wide analysis. We then analyzed and compared a total of 498 small GTPases and 422 GAPs from these four eukaryotic and human genomes. Both animals and yeast genomes contained five families of small GTPases and their GAPs. However, plants had only four of these five families because of a lack of the Ras and RasGAP genes. Small GTPases were conserved with common motifs, but GAPs exhibited higher and much more rapid divergence. On the basis of phylogenetic analysis of all small GTPases and GAPs in five eukaryotic organisms, we estimated that their ancestors had small sizes of small GTPases and GAPs and their large-scale expansions occurred after the divergence from their ancestors. Further investigation showed that genome duplications represented the major mechanism for such expansions. Nonsynonymous substitutions per site (Ka) and synonymous substitutions per site (Ks) analyses showed that most of the divergence due to a positive selection occurred in common ancestors, suggesting a major functional divergence in an ancient era.
- gene expansion
- genome-wide analysis
small gtpases are GTP-binding proteins with molecular masses of 20–40 kDa (10). They exist ubiquitously in eukaryotes and constitute a superfamily (70). These proteins have been implicated in nearly all cellular processes (9). The members of this superfamily are structurally and functionally classified into at least five families: the Ras, Rho, Rab, Arf, and Ran. The Ras (rat sarcoma) genes were first identified as oncogenes in retroviruses (32). Now Ras GTPases are regarded as regulators of gene expression (70). Most Ras proteins were localized to the plasma membrane, and some of them were shown to contain fatty acid acylation signals (16) and to act as functional signal transducers in the endoplasmic reticulum-Golgi complex (14). The Rho (Ras homolog) GTPases are most closely related to members of the Ras family. These proteins have shown to be involved in cytoskeletal reorganization, cell polarity maintenance, and gene expression regulation (24, 33). The Rab (rat brain) proteins were first identified as Ras-related genes in rats (74). They were shown to play important roles in vesicle trafficking (63). The Arf GTPase was first identified as ADP ribosylation factors (39). Another subgroup of this family was Sar, isolated as a multiple copy suppressor of the SEC12mutant (51). Both Arf and Sar proteins are regulators of trafficking of intracellular proteins and membranes as well as involved in remodeling of the actin cytoskeleton (57, 59). The Ran (Ras-related nuclear) protein was originally isolated as a homolog to Ras proteins (17). This protein plays important roles in nucleocytoplasmic transport and microtubule organization (49, 79). Aside from these five families of small GTPases mentioned above, additional proteins that have GTPase functions were reported. However, they fall outside these families in structures and functions (16). Those “distant” GTPases are generally larger in molecular weight and also do not include the lipid modification signals detected in most of the members of these families (16). For example, the Gtr1_RagA family proteins show some similarity to the Arf proteins, but they fall outside the superfamily; nucleostemin family members have GTP-binding activity but no apparent GTPase enzymatic function (16). Therefore, only five families of small GTPases were analyzed in this study.
Despite their divergence in biological functions, all members of these five small GTPase families share a basic biochemical activity: GTP binding (active) and hydrolysis of GTP to GDP (inactive). The interchange between these two forms is regulated by other cellular factors. Among them, the GTPase activating protein (GAP) is one of the key regulators. The GAPs promote GTPase inactivation by facilitating GTP hydrolysis, thereby acting as a negative regulator. Corresponding to small GTPase families, GAPs were also classified into five families: RasGAP, RhoGAP, RabGAP, ArfGAP, and RanGAP. They all contain their own catalytic domains which share no obvious similarities at the amino acid level. Furthermore, most of the GAPs contain numerous other functional domains and potential phosphorylation sites, indicating the complexity in the regulation of GTPase activity.
The evolution of gene families may significantly contribute to the understanding of a genome or species. The small GTPase and its GAP superfamilies contain ∼0.7% of human genes, based on the numbers of protein-coding genes in the human genome (36), indicating the contribution of these two superfamilies to the whole genome evolution. Although analyses of the Rab GTPase family evolution have been carried out (56, 69), little is known about the evolution of the small GTPase superfamily. As for the GAP superfamily, data showed high divergence among five types of GAPs. For example, both RhoGAP and RasGAP showed no sequence homology at their amino acid level; however, they possessed a structural similarity, suggesting that they had evolved from a common ancestor (5, 61). To our knowledge, no data are reported on the evolutionary analysis of the GAP superfamily in plants. In addition, differences in the family size of eukaryotes raise questions about the evolution of the small GTPase and GAP superfamilies.
Here, we report the identification of five families of small GTPase and their GAP genes in five eukaryotic organisms based on their complete genome sequence analysis. We then classified them by phylogenetic analysis using their domain amino acid sequences and assigned possible functions to uncharacterized genes based on known functions of their homologs from within or in other species. We also estimated the numbers of these two superfamilies before their divergence among multiple organisms. In addition to this, we evaluated the contribution of genome duplications to the evolution of these two superfamilies. Finally, we analyzed the substitution rates of the domain regions in an attempt to uncover the selection pressures that shaped these two large superfamilies during evolution.
MATERIALS AND METHODS
We have used several ways to search and predict small GTPase and GAP genes. The representative amino acid sequences were obtained from the Pfam database (http://www.sanger.ac.uk/Software/Pfam/), and their corresponding domain sequences were used as query sequences. For all small GTPase and GAP genes from different organisms, both TBLASTN and BLASTP searches were conducted on their corresponding databases. For rice, the following databases were used: Rice Genome Research Program (RGP; http://rgp.dna.affrc.go.jp/), The Institute for Genomic Research (TIGR; http://tigrblast.tigr.org/), Gramene (http://www.gramene.org/), Oryzabase (http://www.shigen.nig.ac.jp/rice/oryzabase/top/top.jsp), National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov), and DNA Data Bank of Japan (DDBJ; http://www.ddbj.nig.ac.jp). Rice Genome Automated Annotation System (RiceGAAS; http://ricegaas.dna.affrc.go.jp) and the New GENSCAN Web Server at MIT (http://genes.mit.edu/GENSCAN.html) were used for predicting coding regions and their proteins. For Arabidopsis, all predicted genes were obtained by searching TIGR, The Arabidopsis Information Resource (TAIR; http://www.arabidopsis.org/), and Munich Information Center for Protein Sequences (MIPS) Arabidopsis thalianaDatabase (MAtDB; http://mips.gsf.de/proj/thal/) databases. For Drosophila melanogaster, both FlyBase (http://www.flybase.org/) and NCBI databases were used for BLAST searches. For Saccharomyces cerevisiae,both SaccharomycesGenome Database (http://www.yeastgenome.org/) and NCBI were used for the genome-wide searches. All predicted genes were used for similarity searches again to confirm predicted genes and to detect new candidates. Alternatively, the above databases were also searched using “small GTPase” and “activating protein” as keywords to achieve more genes.
Identification of domains in the predicted genes.
The Pfam program was used with E-value = 0.01 as the cutoff to confirm the presence of conserved domains in the predicted genes and use their sequences for phylogenetic analysis. BLASTP in the NCBI and InterPro (http://www.ebi.ac.uk/interpro/) databases were also employed to detect conserved domains. Proteins confirmed by the domain searches were regarded as putative small GTPases or GAPs (referred to as small GTPases or GAPs hereafter for convenience); otherwise, they were excluded from our data set.
Chromosomal mapping and detection of duplicated genes.
For rice small GTPase and GAP genes, the chromosomal distribution of the predicted sequences was performed by searching the map position of corresponding YAC or BAC clones, using the RGP and Gramene databases. Data from references (28, 83, 84) were used to determine whether any mapped genes were located in the duplicated regions. For Arabidopsis, both TAIR and MAtDB databases and the references (8, 71) were used for mapping and determining the presence of a gene in the duplicated regions of the genome. Both the Human Genome Segmental Duplication (Ref. 13; http://projects.tcag.ca/humandup/) and University of California Santa Cruz (UCSC) Genome Bioinformatics (Ref. 40; http://genome.ucsc.edu/) databases were used for mapping and detecting duplicated genes in the human genome. The FlyBase Genome Browser (Ref. 18; http://flybase.bio.indiana.edu/cgi-bin/gbrowse_fb/dmel) and the yeast gene duplications database (Ref. 81; http://acer.gen.tcd.ie/∼khwolfe/yeast/) were used for the mapping and detection of duplicated genes in D. melanogasterand S. cerevisiaegenomes, respectively.
Sequence alignment and phylogenetic analysis.
Preliminary sequence manipulations were performed using the DNASTAR program. Only domain sequences were used for further investigation. The sequence alignment was generated using ClustalX (version 1.8) (72), with manual adjustment, and shaded with the MacBoxshade 2.15 program (http://www.isrec.isb-sib.ch/ftp-server/boxshade/MacBoxshade/; http://www.ch.embnet.org/index.html). The aligned amino acid sequences formed the basis for the phylogenetic analysis. Maximum-parsimony (MP) and maximum-likelihood (ML) analyses were performed using the program Mac PAUP 4.0b8 (ppc) (http://www.paup.csit.fsu.edu). The heuristic search option, with MILPARS, tree-bisection-reconnection (TBR) branch swapping, ACCTRAN (accelerated transformation) optimization, and random addition with 100 replicates, was used to find the best tree. Bootstrap analysis was used to investigate tree support of MP and ML trees. Bootstrap supports (BS) of specific nodes (20) were estimated with 500 replicates for ML analysis, and 1,000 for MP, under default options as implemented in the PAUP program. Nodes with BS values >70% were considered to be supported significantly with 95% probability (34).
Bayesian searches were performed using the MrBayes 3.1 program (Ref. 35; http://mrbayes.csit.fsu.edu/). The WAG evolutionary model was developed from Dayhoff and JTT models and was used for the Bayesian analyses, due to its best attributes in assessing phylogenies from large numbers of sequences with many different families (78). Two independent Bayesian analyses were performed with four Markov chains per analysis, each consisting of seven million generations with trees sampled every 100 generations. The first 250 trees were discarded for calculation of the consensus trees, based on the stationary phase. The remaining trees were imported into the PAUP program to construct 50% majority rule trees. The frequencies of clades on the 50% majority consensus tree provided the posterior probability support for the clades.
Estimation of Ka/Ks ratios and analysis of functional divergence.
For calculation of Ka/Ks ratios (where Ka = nonsynonymous substitutions per site, and Ks = synonymous substitutions per site), domain amino acid sequences were aligned first and subsequently were transferred to original cDNA sequences. Ka/Ksvalues were then calculated using the yn00 program of the PAML package as described (82). The Ka/Ksratios were also used for evaluating the functional divergence by testing the C-value [C = (X − 0.5N)/(0.5X)] as reported earlier (73). Full-length cDNA sequences were obtained to confirm that the majority of the small GTPase and GAP genes were functional by searching the following databases: Knowledge-Based OryzaMolecular Biological Encyclopedia (KOME) (Ref. 42; http://cdna01.dna.affrc.go.jp/cDNA/), RIKEN Arabidopsisand Genome Encyclopedia (RARGE) (Ref. 64; http://rarge.gsc.riken.jp/index.html), the human full-length cDNA (FLJ-DB) (Ref. 54; http://fldb.hri.co.jp/cgi-bin/cDNA3/public/publication/index.cgi), and Drosophilacomplementary DNA resource (Ref. 62; http://www.fruitfly.org/) databases.
Genome-wide identification of small GTPase and GAP genes in eukaryotes.
To survey the small GTPase and GAP genes in eukaryotes, five different organisms were selected. They are Oryza sativa(monocot plant), A. thaliana(dicot plant), Homo sapiens(vertebrate), D. melanogaster(invertebrate), and S. cerevisiae(fungus). After multiple cycles of searches and domain analysis (see materials and methods), a total of 111 small GTPase and 85 GAP genes were detected from the rice genome (Fig. 1A and Supplemental Tables S1 and S2; Supplemental Materials are available at the PhysiologicalGenomicsweb site).1 These genes were first identified by in silico genome-wide analyses, although a small portion were identified by experimental studies (3, 15, 41, 48, 53, 85). In the Arabidopsisgenome, a total of 93 members of small GTPase genes have been identified and reported earlier (76). In the current study, we detected 65 GAP genes in the genome (Fig. 1A, Supplemental Table S3), of which 15 among 17 ArfGAP genes were previously identified (76). However, in the human genome, both small GTPase and GAP genes were predicted earlier (6, 16), and we have used these members for further analyses. In D. melanogaster,we identified 90 members of small GTPase genes (Fig. 1A, Supplemental Table S4) in contrast to the 64 members of GAP genes identified previously (6). In S. cerevisiae,only 30 small GTPase genes were detected earlier (27, 70). However, our analysis revealed at least 35 GAP genes in the yeast genome (Fig. 1A, Supplemental Table S5).
A positive correlation can be inferred between the genome size and the number of small GTPase and GAP genes per genome, although this relationship lacks direct proportion (Fig. 1B). Generally, the size of the small GTPase gene superfamily is larger than that of the GAP gene superfamily. The ratio of GTPases to GAPs is 1.18:1 on average (Fig. 1A). However, the yeast contains more members of GAPs than GTPases; thus the ratio is reduced to 0.86:1 (Fig. 1A). In addition to these, our analyses also showed that both superfamily genes accounted for 0.22–0.87% and 0.17–0.87% of protein-coding genes, respectively, in eukaryotic genomes, indicating the large sizes of these superfamilies. Among them, the small GTPase genes of the rice genome comprised only 0.22–0.35% of all estimated rice-coding genes. The percentage was reduced to 0.17–0.27% for the GAP superfamily (Fig. 1A).
Both rice and Arabidopsis genomes lack Ras small GTPases and their activating proteins.
To classify the small GTPase and GAP genes, we retrieved amino acid sequences of those genes that were identified previously in all five organisms. For the human small GTPase genes, RAC4was not included as it was a pseudogene (16). Therefore, only 174 members were used for further analysis. In addition, 93 Arabidopsissmall GTPase, 173 human GAP, 64 DrosophilaGAP, and 30 S. cerevisiaesmall GTPase sequences were collected (Fig. 1A). These sequences, together with those detected in this study (Supplemental Tables S1–S5), were submitted to Pfam and NCBI databases for domain analysis. In total, 498 small GTPase and 422 GAP domain sequences from five organisms were analyzed in this study.
Phylogenetic trees were constructed on the basis of domain sequences of these genes from each organism. This analysis showed that the rice small GTPase and GAP genes could be classified into only four families each: 17 Rho and 23 RhoGAPs, 47 Rab and 25 RabGAPs, 43 Arf/Sar and 34 Arf/SarGAPs, as well as 4 Ran and 3 RanGAPs (Fig. 2). These genes were named using their species name followed by family name and the number for each gene. For example, the 47 members of the Rab family in O. sativawere named OsRab1to OsRab47. However, in animals and yeast, the Ras GTPase family was found in addition to those four families (Fig. 2). The Ras family was the first detected member of the small GTPase superfamily, which was absent in rice. A previous report also showed that the Ras members were absent in Arabidopsis(76). Our result also confirmed this observation (Fig. 2). Similarly, RasGAP was also not detected in both plant species (Fig. 2). On the other hand, SarGAPs were separated from ArfGAPs in rice, Arabidopsis, and S. cerevisiae, indicating more divergence in these organisms.
Small GTPases showed common motif structures, whereas GAPs exhibited high divergence.
Although small GTPase genes could be classified into five groups, they shared common motif structures (Fig. 3A). All small GTPases contain five conserved motifs, named “G box” sequences. The G1 box is a purine nucleotide-binding signature with the structure aaaaGxxxxGK, where a = C, V, T, L, I, or M, and x = any amino acid. The G2 box is less conserved, providing major components of the effector-binding surface. However, this motif is highly conserved in the Ras family (Fig. 3,Aand B). The G3 box, with the structure blbbDxxGQ (b = hydrophobic, and l = hydrophilic), is involved in binding a nucleotide-associated Mg2+. The G4 box [bbbb(N, T, or L)KxD] makes contact with the guanine ring through hydrogen bond. The G5 box is the most divergent motif, making indirect associations with the guanine nucleotide. Among the five families of small GTPases, Ran and Rho GTPases were most conserved in five G boxes, and Rab GTPases exhibited more divergence (Fig. 3B), corresponding to their divergence in biological functions (see below). Aside from these domains specific to different families of small GTPases, this superfamily contained at least 16 other domains as listed in Fig. 3C. Some of the domains provide protein- or ion-binding sites or signal cassettes, whereas others act as regulators of their activation, making the superfamily more complicated with multiple functions.
On the contrary, GAPs showed more divergence. While all representative domain sequences were aligned together, they showed no conserved residue with perfect homology (data not shown). To outline the conserved motifs, the representative sequences within each family were aligned separately (Fig. 4). The alignment from RasGAP representatives showed four highly conserved blocks, blocks 1, 2, 3A, and 3B (Fig. 4A), similar to the previous report (52). All RhoGAPs contained a RhoGAP domain that exhibited three conserved blocks as shown in the alignment of representative domain sequences (Fig. 4B), confirming the previous report (43). Most of the RabGAPs had the domain TBC (58). This domain contained six highly conserved motifs named A–F (Fig. 4C), which were important to catalyze the activity and to stabilize the architecture (58). Different from other GAP families, most of the ArfGAP domain contained a cysteine motif CxxCxxxxxxxxxxxxxxxxxCxxC (Fig. 4D), as described earlier (46). RanGAP is a leucine-rich repeat (LRR) domain containing protein (Fig. 4E), and each LRR forms a short β-strand and a longer α-helix that result in a β-α hairpin motif (68). On the other hand, many predicted or characterized GAP genes also encode other domains, similar to the small GTPases. Some of these domains were present in all families of GAP, some of them were family specific, and others were species specific (7).
Families of small GTPase or GAP genes showed differences in family expansion and functional divergence.
Phylogenetic analysis showed that the eukaryotes had evolved with different sizes of families of small GTPase genes (Fig. 2). Among them, Rabs represent the largest family in general. For example, the Arabidopsisgenome encodes 57 members of Rab GTPases, consisting of more than one-half of all small GTPase genes. On the other hand, eukaryotes encode only one to four members of Ran GTPases. These results suggested that the eukaryotic organisms require different sizes of families of small GTPases.
To analyze the functional divergence within a family, phylogenetic analysis was carried out using the domain sequences from different organisms. The Ras members from only three nonplant organisms were employed for phylogenetic tree construction due to absence of this family in plants. The analysis revealed five subclasses (I–V) of the Ras family (Fig. 5). Among them, both NKIRAS1 and NKIRAS2 were human specific, and only two subclasses (I and II) were detected in S. cerevisiae(Fig. 5). The members that group into the same subclass might share a similar function or show less functional divergence. For example, Rsr1 in S. cerevisiaewas related to human Rap1A with a similar function (47) and belonged to the same subclass (I) (Fig. 5). All Rho members from five organisms were also grouped into five subclasses (I–V in Fig. 5). Among them, ArabidopsisRho GTPases were grouped into only one subclass (II), which was present in all five organisms. The human Rho GTPases have evolved into an extra subclass (V) with no homologs in other organisms. This class had three members: RND1, -2, and -3. In contrast, rice and DrosophilaRho GTPases were clustered into three subclasses (I, II, and III for rice and II, III, and IVfor Drosophila), and the yeast had two subclasses (II and III) (Fig. 5).
Studies showed that the Rab family could be grouped into eight subclasses (56). However, nine subclasses were clustered in this study due to the presence of an extra subclass (VI) in Drosophilaincluding CG9807, 32670, 32671, 32673, Rab9D, and RabX2 (Fig. 5). Among these nine subclasses, three were human and Drosophilaspecific, including subclasses IV, VII, and IX. Rice and Arabidopsiscontained five subclasses (I, II, III, V, and VIII), which were present in both human and Drosophila. The class V was the largest one in Arabidopsis, containing three subfamilies as described (76). However, only three subclasses (II, V, and VIII) could be detected in the yeast, which were present in other eukaryotes (Fig. 5). Rice, human, and Drosophilacontained five subclasses (I–V) of Arf GTPases (Fig. 5). However, no Arf GTPase of subclass III was found in Arabidopsis, and subclass IV was absent in the yeast (Fig. 5). No species-specific Arf GTPase was detected. On the other hand, Ran GTPases showed less divergence. In humans, only one Ran GTPase gene was detected, and other organisms contained two to four members and all of them were classified into one functional group (Fig. 5).
Compared with the Ras family, the RasGAP family was grouped into only four subclasses (I–IV) (Fig. 6). Among them, the RasGAP in Drosophilawas specific. No homolog was found in the human and the yeast genome (Fig. 6). On the contrary, the RhoGAP family showed more divergence. Eukaryotes contained more members of the RhoGAP family than of the Rho family. Except for humans, all the other organisms analyzed here diversified into more subclasses of RhoGAPs than of Rho GTPases (Figs. 5 and 6). For example, only one class (II) of Rho GTPase family was found in Arabidopsis(Fig. 5), whereas three classes (II, III, and V) of RhoGAP genes were detected in the genome (Fig. 6). On the other hand, two classes (I and II) of RabGAP family were detected only in rice and Arabidopsis(Fig. 6), the potential candidates for plant-specific RabGAP subclasses. Both MGC16169 in human and CG4041 in Drosophilawere grouped together, and no members in this class (IX) were found in other organisms (Fig. 6). As for the ArfGAP family, only three subclasses (I, III, and IV) were detected in Drosophila, and all other organisms possessed four classes even though five classes were clustered for this family (Fig. 6). The RanGAP family showed more divergence than the Ran GTPase family, and three classes (I, II, and III) were grouped (Fig. 6), two of which contained only one member (RanGAP1 in humans for class II and RanGAP in Drosophilafor class III), and other members belonged to class I (Fig. 6).
Genome duplications represent the major mechanism for small GTPase and GAP gene expansions.
To analyze the distribution of small GTPase and GAP genes, we first mapped the chromosome location of the rice gene family members (Fig. 7A). These genes were mapped on all 12 chromosomes, indicating a wide distribution in the rice genome. Similar chromosomal distribution of these genes was observed in Arabidopsis as well (Fig. 7B). However, in animals, no small GTPase or GAP genes were detected on the Y chromosome and no GAP genes on human chromosome 21. On the contrary, differences were observed in the yeast, where chromosomes I, VIII, and X contained no small GTPase genes, and chromosomes I, III, and VII contained no GAP genes (Fig. 7B).
To determine the contribution of the whole genome duplication and reshuffling to the gene expansion of these two superfamilies, we compared the chromosome duplication pattern with the location of the small GTPase and GAP genes and their phylogeny. Figure 7Ashows the localization of small GTPase and GAP genes in rice chromosomes. Among these genes, 53.6% of small GTPase and 65.9% of GAP genes were mapped on hypothesized duplication/reshuffling genome regions indicated by red color. In addition to this, 12.7% of small GTPase and 10.6% of GAP genes were tandemly clustered (blue in Fig. 7, Aand B). A similar method was used for the analyses in the other four organisms. The result showed that 79.6% of small GTPase and 70.8% of GAP genes were distributed in duplicated regions of the Arabidopsisgenome (Fig. 7B). In the human genome, most of them were located within repeated regions including short interspersed nuclear elements (SINEs) and long interspersed nuclear elements (LINEs), and some of them were located on segmental duplication regions. As a result, 93.3% of small GTPase and all GAP genes were mapped on duplicated or repeated genome regions (Fig. 7B). In Drosophila, most of these genes were located on tandemly duplicated chromosome regions, and some of them were on inter/intrachromosomal duplication regions. In total, 83.3% of small GTPase and 90% of GAP genes were found on duplicated genome regions (Fig. 7B). However, in the yeast genome, only 37.9% of small GTPase and 34.3% of GAP genes were located on duplicated regions, indicating some differences in gene expansions between the yeast and higher eukaryotes. In general, genome duplication due to segmental duplication or repeat elements significantly contributed to the gene expansions of small GTPase and GAP families, despite the differences in the percentages. On the other hand, 3.5–12.9% of small GTPase genes in Arabidopsis, humans, and Drosophilaand 4.6% of GAP genes in humans were tandemly clustered on chromosomes (Fig. 7B) and phylogenetic trees (data not shown), indicating a low-level contribution of tandem duplication of small GTPase and GAP genes to the family expansions.
Large-scale expansions occurred after the divergence from their ancestors.
To infer the patterns of gene family expansions, we aligned the domain sequences from each of these superfamilies. The alignments were used to generate the phylogenetic trees shown in Fig. 8A. Subsequently, the phylogenetic tree was broken down into ancestral units, which were clades that were present before the divergence of these organisms according to the method described by Shiu et al. (67). The basal nodes of these ancestral units were labeled with solid red circles (Fig. 8A). We found that there were 10 small GTPase and 12 GAP ancestral units in the phylogenetic trees, based on all predicted genes in five organisms, respectively (Fig. 8A). The result indicated that the ancestral organism contained small families of small GTPase and GAP genes and suggested that the large scale of expansions occurred after the divergence. A similar method was used for searching the numbers of the ancestral organism between humans and DrosophilaandS. cerevisiae. The analysis showed that the ancestral organism among these three organisms contained 17 members of small GTPase and 16 GAP genes (labeled with solid black circles in Fig. 8A). Subsequently, we further analyzed the gene numbers of the common ancestor between five organisms (Fig. 8B). The common ancestor between humans and Drosophilahad the largest family size, containing 53 small GTPase and 45 GAP genes (Fig. 8B). After the monocot-dicot split, their common ancestor still had small families with 34 small GTPase and 33 GAP genes (Fig. 8B). These results confirmed that the large scale of gene expansions occurred after the divergence from their common ancestors.
GAP superfamily diverged much more rapidly than the small GTPase superfamily.
To understand the divergence of small GTPase and GAP gene superfamilies, nonsynonymous substitutions per site (Ka) and synonymous substitutions per site (Ks) and their ratios (Ka/Ks) were estimated for these two superfamilies (Fig. 9). The Ka/Ksratios <1 may result from the elimination of most nonsynonymous substitutions by purifying selection, the ratios >1 indicate diversifying selection, and the ratios equal to 1 represent neutral selection (44). All five small GTPase and GAP families showed average Ka/Ksratios with values <1 (Fig. 9, Aand B), indicating that the domains from these two superfamilies were generally subjected to purifying selection. Furthermore, all small GTPase families showed ratios <0.5 (Fig. 9A), while two of the GAP families (RhoGAP and ArfGAP) showed values >0.5 (underlined in Fig. 9B). The average ratio was 0.346 for the small GTPase superfamily, and this ratio was increased to 0.493 in the GAP superfamily (Fig. 9, Aand B). The higher mean Ka/Ksratio implies that the GAP superfamily has generally diverged much more rapidly than the small GTPase superfamily. The implication was strengthened by another analysis where all average Ka/Ksratios within organisms in the small GTPase superfamily were lower than their corresponding ratios in the GAP superfamily (Fig. 9, Aand B). The percentage of Ka/Ksratios with values <0.5 was 79.5% for the small GTPase superfamily, while the percentage was reduced to 63.7% for the GAP superfamily.
On the other hand, although all average Ka/Ksratios within organisms were <1, we found some pairs in Rab and RabGAP families that showed ratios >1 when we analyzed the distributions of Ka/Ksratios (Fig. 9, Aand B). These observations suggested that both Rab and RabGAP showed much more rapid divergence than other families, and some members from these two families might be subjected to a positive selection. These might contribute to the expansions of these two families, and, as a result, evolution into the larger sizes of families in the two superfamilies.
Most divergence due to positive selection occurred in common ancestors.
To further understand the divergence of domains from these two superfamilies, we estimated all Ka/Ksratios based on all genes within a superfamily. Some Ka/Ksratios from human Rab and ArabidopsisRabGAP families showed values >1, indicating that divergence due to positive selection had occurred. Except for the Rab in humans and RabGAP in Arabidopsis, all ratios within an organism showed values <1 (Fig. 9, Aand B), indicating that the divergence within an organism was under purifying selection. However, the Ka/Ksratios between organisms were different from those within the organisms (Fig. 9, Aand B). Because some Ka/Ksratios from each family showed values >1, sliding window analyses were conducted based on all Ka/Ksratios within a superfamily (Fig. 9C). The analysis clearly showed that some Ka/Ksratios for both small GTPase and GAP superfamilies have values >1, indicating functional divergence due to the action of positive selection among organisms. The result suggested that most of the divergence due to positive selection occurred in common ancestors despite some positive selection in both human Rab and ArabidopsisRabGAP family.
Small GTPase and GAP genes are ubiquitous and ancient superfamilies.
Despite the prediction of small GTPase and GAP genes in humans by genome-wide analysis (6, 16), such a complete set of information is not available so far in other eukaryotes. For example, in the Drosophilagenome, GAP genes were identified (6), but there was no report on small GTPases. In contrast, in the Trypanosoma brucei, S. cerevisiae, and Arabidopsisgenomes, only small GTPase genes were detected (1, 22, 27, 30, 76) and no GAP genes were identified. Apart from these, despite the availability of complete rice genome sequences (28, 83), there was no report so far on the genome-wide identification of either GTPase or GAP genes. In addition to this, comparative and evolutionary analyses of these small GTPases and their GAP superfamilies from different groups of organisms have not previously been reported. We have not only identified the small GTPases in Drosophila, GAP genes in the budding yeast and Arabidopsisgenomes, and both of these superfamily genes in the rice genome, but have also compared these genes with human genes and analyzed their evolutionary significance.
Small GTPase and GAP genes are not only present in eukaryotes but also in at least seven prokaryotic genomes (19, 25, 55), indicating that these were ancient superfamilies. However, prokaryotes do not use small GTPases in the same way as eukaryotes do (11). The universality, higher divergence between eukaryotes and prokaryotes, and strong conservation within eukaryotes suggested that the major functional diversification of the small GTPases occurred after the separation from prokaryotic ancestors but before the diversification of present eukaryotic groups, as suggested by Jekely (38).
Expansion/contraction of small GTPase and GAP superfamilies and retention/loss of duplicates.
More than 90% of genes in the genomes of humans, mice, and rats were present as a single copy in the common ancestor of primates and rodents 65 to 110 million years ago (37, 50, 60). However, 10 small GTPase and 12 GAP genes have been detected in the common ancestor of animals, plants, and yeast, indicating that the expansion events occurred in an ancient era. On the other hand, the small sizes of the two superfamilies suggested that the large-scale expansions might have occurred after the divergence from their common ancestors. Subsequently, we surveyed the distribution of members from two superfamilies in genomic chromosomes. The results showed that the majority of members were mapped on duplicated regions, suggesting genome duplications as the major mechanism for the expansion of these two superfamilies. This finding was confirmed by the fact that the larger size of genome contained more members of small GTPase and GAP genes (Fig. 1B). Relatively small sizes of these superfamilies in the yeast might be due to the very low gene duplication rate (26) and low percentage of duplicated regions in the genome (81) compared with other genomes, and as a result, fewer members were located on duplicated regions, exhibiting limited expansions.
In general, the eukaryotes have evolved into different sizes of families of small GTPase genes (Figs. 2 and 8A). Among them, the Rab family is the largest one in general, consisting of nearly one-half of the small GTPase genes in some eukaryotic genomes. The Ran family is the smallest one. Only one member was detected in the human genome, two members in Drosophilaand S. cerevisiae, and four members in plants. Similarly, the GAP superfamily was also grouped into five families in humans, Drosophila, and yeast, and no RasGAP could be detected in plants. However, inconsistent expansion was observed between small GTPases and their corresponding GAP families (Fig. 1C). For example, only 23 Rho genes were detected in the human genome, whereas 70 RhoGAP genes were predicted in the genome (Figs. 1Cand 2). Except for S. cerevisiae, the largest family is Arf/SarGAP instead of the RabGAP family in the GAP superfamily. Therefore, duplication-based expansions are restricted to certain families, not all families, and different families showed different expansion rates.
One may argue that various genomes contain different sizes of small GTPase and GAP gene superfamilies because of the presence of different pseudogenes in corresponding genomes. During gene expansion, duplicated genes probably became pseudogenes and nonfunctional. To test whether duplicated genes had evolved into pseudogenes, the Ka/Ksratios of the pairs were estimated and tested statistically. The ratio of 0.5 was taken as conservative criterion to test whether both copies of the gene duplicates are functional (73). The Ka/Ksratios of two superfamilies are shown in Fig. 9. All of the average Ka/Ksratios for each organism were <0.5 (Fig. 9, Aand B). We then tested statistically the null hypothesis that Ka/Ksratios were likely to be <0.5 or >0.5. The calculated C-value (see materials and methods) is 18.51 for the small GTPase superfamily and 8.02 for the GAP superfamily, indicating that the probability of the null hypothesis is very low (P ≪ 0.001 and P < 0.01, respectively). Therefore, most of the pairs are functional in both copies of the duplicated genes. The results were confirmed by the fact that >90% of the yeast small GTPase and GAP genes were experimentally functional, based on the database search of Gene Ontology (http://www.geneontology.org/). Among 111 rice small GTPase genes, 85 were detected to contain full-length cDNA sequences (Supplemental Table S1) by searching 28,000 KOME full-length cDNAs sequences (42). Similarly, 61 among 85 rice GAP genes were detected to have cDNA sequences (Supplemental Table S2). In Arabidopsisand Drosophila, at least 76.7% of genes were detected to contain cDNA sequences (Supplemental Tables S3 and S4). These findings indicate that pseudogenes are not likely the major contributing factor in the size differences among the five organisms.
On the other hand, gene duplications occurred frequently; however, most were lost during evolution (45). Most gene families were small, and large families were exceptions rather than the norm (66). This study showed that large-scale duplications mainly contributed to gene expansion, and some families exhibited less or no expansion. At least four large-scale duplication events occurred 100–200 million years ago (MYA) approximately at the time of the divergence between monocots and dicots (77). Thus the common ancestor of rice and Arabidopsismight have possessed 160 small GTPase and 192 GAP genes after four rounds of whole genome duplication, based on the presence of 10 and 12 small GTPase and GAP genes before the divergence of plants, animals, and yeast. However, our analysis showed that the common ancestor of rice and Arabidopsismight have contained only 34 small GTPase and 33 GAP genes (Fig. 8). These facts suggested that, after duplication of these genes, most of them might have been lost and only a small number remain.
Because the rates for both duplication and loss are high in general, it is of interest to consider how natural selection appears to have differentiated retention of duplicated genes in different families of small GTPase and GAP genes. After gene duplication, one copy might be silenced by a deleterious mutation or degenerated and ultimately disappear due to the absence of any selective constraint within the genome (65). Alternatively, both copies may acquire complementary degenerative mutations (23). A third possibility is that the duplicated genes quickly changed to produce novel or a subset of the original functions. On the basis of the analysis of Ka/Ksratios, the small-sized families usually possess lower ratios; thus most duplicated genes may be degenerated and a few of them may be retained due to complementary degenerative mutations. As a result, the families remained small. On the other hand, larger-sized families usually have higher Ka/Ksratios, and the ranges of distribution of Ka/Ksare also wide, containing a higher percentage of Ka/Ksratios with values >1. Therefore, higher percentages of duplicated genes were retained due to the presence of novel biological functions, making families larger during the long evolutionary history.
The loss of the Ras family in the plant kingdom has been previously reported (76, 80). However, the loss of RasGAP in rice and Arabidopsiswas not yet reported. In the phylogenetic tree as shown in Fig. 8A, the RasGAP family from nonplants was closely related to the RabGAP family. In the RabGAP family, two classes of members (class I and II as shown in Fig. 6) were plant specific. Why have both rice and Arabidopsisevolved into these two classes of proteins? One implication is that the evolution of specific types of RabGAP proteins in higher plants may have been an adaptation to compensate for the loss of function of RasGAP proteins. Another implication is that some families showed no expansion (for example Ran and RanGAP in humans) or less expansion (for Ran and RanGAP in plants as well as Rho family), and others showed high expansion. As a result, some gene lineages produced many descendants, whereas others produced fewer.
Evolutionary history of small GTPase and GAP superfamilies in eukaryotes.
Our analysis indicated that the small GTPases and GAP gene families in eukaryotes belonged to large families. However, the size of these superfamilies would likely be small in the common ancestor of animals, plants, and fungi. In this study, we estimated the numbers of small GTPase and GAP genes in the most recent common ancestor (MRCA) among five organisms to be 10 and 12, respectively (Fig. 8, Aand C). After the divergence of plants and fungi/animals around 1,200 MYA (21), the numbers of MRCA among human, Drosophila, and yeast had increased to 17 for small GTPase genes and 16 for GAP genes, indicating a slow evolutionary expansion. After the divergence of vertebrates and invertebrates about 650 MYA (31), the MRCA of human and Drosophilamight have contained 53 small GTPase and 45 GAP genes. The MRCA of rice and Arabidopsisafter the divergence of monocot and dicot around 150 MYA (12) might have had only 34 small GTPase and 33 GAP genes. Because of the possible independent gene losses, our estimate of the ancestral gene number should be treated as lower bound. On the other hand, studies showed that the size of small GTPase and GAP superfamilies in a prokaryotic genome was much smaller, and the genome usually contained only one or two members in each of the superfamilies (19, 25, 55) despite the long evolutionary history. These facts suggested that the common ancestor of prokaryotes and eukaryotes might have contained only one member of each of the small GTPase and GAP genes, and the expansions were largely restricted to the eukaryotes.
We are grateful for the assistance of Australian-born education writer Geoff Lemon in proofreading and editing this document.
↵1 The Supplemental Material for this article (Supplemental Tables S1–S5) is available online at http://physiolgenomics.physiology.org/cgi/content/full/00210.2005/DC1.
Address for reprint requests and other correspondence: S. Ramachandran, Temasek Life Sciences Laboratory, 1 Research Link, National Univ. of Singapore, Singapore 117604 (e-mail:).
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).
- Copyright © 2006 the American Physiological Society