## Abstract

Here we present an amino acid translation program designed to suggest the position of experimental frameshift errors and predict amino acid sequences for full-length cDNA sequences having phred scores. Our program generates artificial insertions into artificial deletions from low-accuracy positions of the original sequence, thereby generating many candidate sequences. The validity of the most probable sequence (the likelihood that it represents the actual protein) is evaluated by using a score (V_{a}) that is calculated in light of the Kozak consensus, preferred codon usage, and position of the initiation codon. To evaluate the software, we have used a database in which, out of 612 cDNA sequences, 524 (86%) carried 773 frameshift errors in the coding sequence. Our software detected and corrected 48% of the total frameshift errors in 62% of the total cDNA sequences with frameshift errors. The false positive rate of frameshift correction was 9%, and 91% of the suggested frameshifts were true.

- phred score
- Kozak consensus
- codon usage
- initiation codon
- base-call error

a large-scale sequencing effort has been started at the Institute of Physical and Chemical Research (RIKEN) that includes compiling the complete coding sequence (CDS) of all the mouse full-length cDNAs. Base-call error in the CDS, especially frameshift error, is an relevant problem when trying to deduce amino acid sequences from uncorrected cDNA or genomic sequences. Complete determination (finishing) of these sequences is time-consuming, and information about the predicted position of frameshift error is useful when designing primers for the finishing step. In addition, having a first view of the possible amino acid sequence facilitates study and classification of genes before the finishing steps.

Several methods have been developed to identify and correct sequencing-derived frameshift errors. Iseli and colleagues (9) have developed ESTScan, a program for detecting and reconstructing the coding sequence in expressed sequence tags (ESTs) that may carry frameshift errors. For predicting the CDS, this method adopted a novel hidden Markov model, which is robust against frameshift error. However, most computer software programs designed to correct frameshifts in DNA sequences have been developed for analyzing genomic DNA rather than full-length cDNA sequences (11, 13, 15). Efforts to determine all the full-length cDNAs of an organism, such as the RIKEN Full-Length cDNA Encyclopedia Project, require high-quality correction of unfinished cDNA sequences to maximize the accuracy of the putative amino acid translation predictions.

The conventional approach to correct sequence errors tries to identify frameshift errors through exon predictions of the six frames of the experimental sequence. Exon prediction can be achieved through homology search against known genes or genomic sequences of closely related organisms. Alternative exon prediction methods (e.g., the Markov model) are more elegant and incorporate amino acid frequency, preferred codon usage, and other statistical information. Regardless of the method used, the final result is that two, shorter open reading frames (ORFs) are merged by changing frames, thereby creating a single, longer ORF. However, the conventional method is associated with two main disadvantages.

The first problem with the conventional method of exon prediction is that this method may predict two or three exons, reflecting different frames, for the same region. In addition, the regulatory mechanism of translation may require a low preferred codon usage in the CDS to regulate the amino acid translation for some sequences. In this situation, the difficulty is that conventional methods will predict the exons that comprise the corrected CDS.

The second problem is that traditional software programs for exon prediction are designed to correct frameshifts in light of information from text-base (sequence of bases) analysis of DNA sequences. However, these programs fail to account for the quality of the data (e.g., the clarity of the gel and length of run) that gave rise to the inaccurate sequence. Both the text-base and sequence quality information are very important components of the correction process.

To overcome these problems, we developed a new trial-and-error method of exon prediction that is designed especially for full-length cDNA sequence data. Our method artificially inserts bases into and/or deletes bases from the experimental sequence and predicts a candidate CDS for every trial insertion and deletion. If the experimental frameshift error is removed by the trial insertion/deletion, then a CDS-like ORF will emerge. The most convincing sequence is then chosen from the various candidates.

This method circumvents the principal problems of conventional exon prediction schemes in the following ways. First, it incorporates additional information that is specific to full-length cDNA, such as the length of the ORF, Kozak consensus (10), and length of the 5′ untranslated region (UTR). Second, our method incorporates information about the quality of the uncorrected sequence data. A numerical score, the phred score, represents the sequence quality. This information is useful for suggesting the position of a frameshift error (5, 6). Furthermore, a quantitative index for evaluating the reliability of amino acid translation would be helpful when using the resulting hypothetical proteins. To this end, our method estimates the validity of the predicted CDS (the likelihood that the predicted CDS is the corrected CDS) in light of the phred score of the base at which the putative frameshift occurs.

## METHOD

Removing a frameshift that arose from a sequencing error requires inserting or deleting a variable number of bases. Designed to correct frameshifts in this way, our computer program is composed of three principal steps. The first step generates all possible sequences created after one or two bases are inserted into or deleted from the site of the presumed frameshift. The second step calculates the V_{a} score, which represents the likelihood that the modified sequence is the correct sequence. In the third step, the V_{a} score is used to choose the most probable sequence among the various candidates. The V_{a} score reflects the Kozak consensus (10), preferred codon usage (1, 4, 8), and the position of the initiation codon. The V_{a} score is calculated for all candidate-complete CDSs, which are the sequences between any ATG and any stop codon. The candidate CDS with the lowest V_{a} score is chosen as the corrected CDS.

Let us suppose the existence of a set of an infinite number of theoretically allowed CDSs that start with ATG and end with a stop codon. These sequences are sorted in light of preferred codon usage, Kozak consensus, and the position of the initiation ATG. Both the sequence of a codon that is used frequently and the sequence of a good Kozak consensus have a low index, *t*. The most probable CDS-like sequence has an index number of 0, and the less probable CDS-like sequence receives an index of 1. If the candidate sequence has the index *P*, then the V_{a} score of the candidate sequence is defined as V_{a} = ln *P*. Thus the V_{a} score represents the log of the probability that one could obtain by chance a sequence more CDS-like than the candidate sequence in the sorted sequences. Note that the V_{a} score is not the log of the probability that the sequence carries CDS, since our method is based on the rank order of sequences. The V_{a} includes a frameshift penalty that depends on the phred score at the artificial frameshift position if the artificial insertion(s) and/or deletion(s) is added to the candidate sequence. The V_{a} of the candidate CDS is defined as 1 2 3 where *P*_{codon-pref}, *P*_{Kozak}, and *P*_{atg} are the probabilities of generating a random sequence that shows a more preferable codon usage, Kozak consensus, and ATG position, respectively, than the candidate sequence that is generated from the unadjusted cDNA. In *Eq. 2*, *i* is the base position of the cDNA sequence, *w*_{1} is the position of one of the ATGs, and *w*_{2} is the position of one of the stop codons in the sequence when we focus on a single ORF. Note that the value of *i* changes every three bases. The higher the preferred codon usage, the better the sequence. All events (preferred codon usage, Kozak consensus, and ATG position) are assumed to be independent of each other. In *Eq. 3*, *j* is the base position at which the artificial insertion/deletion occurs. *P*_{err}(*j*) is the probability of a base-call error at frameshift position *j*.

To reduce the computation time, the number of artificial insertions/deletions is restricted to no more than the number of experimental frameshift errors. The experimental frameshift error is expected to occur around bases for which base-calling was not reliable. The reliability of the base-call is the so-called phred score (Q), which is calculated from the dispersion of the distance and height of the signal peaks of each base and the distance from the nearest nonassigned base (“*N*”). To restrict the number of artificial insertions/deletions, a penalty value is added to V_{a} for each artificial insertion/deletion, and the penalty value (*P*_{frameshift}) depends on the phred score of the position at which the artificial insertion/deletion occurs.

The penalty for frameshifts (*P*_{frameshift}) is not a probability but rather an empirical parameter. The probability of a frameshift can be calculated from *P*_{err}, which is a parameter of an experimental process. In contrast, the other probabilities (*P*_{codon-pref}, *P*_{Kozak}, and *P*_{atg}) are parameters of the artificial process (random sequence generation); note that *P*_{codon-pref}, *P*_{Kozak}, and *P*_{atg} are not the probabilities that the sequence could be the true CDS, because the actual sequence is not mathematically random. Therefore, the penalty for frameshift (*P*_{frameshift}) cannot be treated as same as the parameters *P*_{codon-pref}, *P*_{Kozak}, and *P*_{atg}. However, it is natural to introduce the following relation: the lower the probability of base-call error (*P*_{err}), the greater the penalty for frameshift (*P*_{frameshift}). *Equation 3* is an equation that satisfies this relation.

*P*_{err} in *Eq. 3* is calculated from Q by the definition (5, 6) 4 In *Eq. 3*, *D*_{penalty} is a parameter that must be optimized for maximal prediction accuracy.

To estimate the number of sequences generated by the trial-and-error process, let the maximum number of insertions and/or deletions be two and the number of base pairs in the input sequence be *N*. The program generates one sequence with no artificial frameshift, 2 × *N* sequences with a single artificial insertion or deletion, and 2 × *N* × (*N* − 1) sequences with two frameshifts by combining insertions and deletions.

The program thus generates a total of 1 + 2*N* + 2*N*(*N* − 1) sequences, and V_{a} is calculated for all frames of these sequences. In actual use, the number of insertions and deletions per test sequence is limited to two or four. For three and/or four insertions/deletions, the trial-and-error procedure is divided into two steps. The trial using one and two artificial frameshifts is performed first, and the most probable sequence is chosen. Then the additional artificial frameshifts are applied to the sequence generated in the previous step, and the most probable sequence is chosen from this second set of trial sequences.

*P*_{codon-pref}, *P*_{Kozak}, and *P*_{atg} are derived from the statistics of the actual cDNA database. To determine *D*_{penalty}, we performed the CDS prediction for a data set of known genes while changing *D*_{penalty} to find the optimal value for this variable.

We prepared a test data set from 2,815 complete CDS mouse cDNAs in the National Institute for Biotechnology Information (NCBI) nonredundant database. The minimum length of the 5′-UTR is set at 25 bp. All statistical data in the following sections are derived from this data set. The base composition of the CDS is: A, 25.9%; G, 26.4%; C, 25.9%; and T, 21.8%. In the 5′-UTR, the base composition is: A, 20.8%; G, 29.2%; C, 29.8%; and T, 20.3%. The populations of A, G, C, and T are almost equivalent in the CDS, whereas the 5′-UTR is GC rich. We assume that the populations of A, G, C, and T in the CDS region are equivalent.

The data of the complete CDS mouse cDNAs of the NCBI nonredundant database are not necessarily full length. Since Suzuki and colleagues (14) have succeeded in making and analyzing a 5′-rich, full-length-rich cDNA library, we adopted their statistics.

Table 1 shows the various possible Kozak sequences that can occur around the initiation codon and the probability of a randomly generating a sequence with a more preferable consensus (*P*_{Kozak}) than that of each string in Table 1. The consensus is denoted as XnnatgY (X, Y = A, G, C, T), where X and Y are the bases whose identities are in question. GnnatgG is the second-most preferred consensus among the 16 options; therefore, *P*_{Kozak} of GnnatgG equals 2/16, which means that a consensus that is more preferable than GnnatgG occurs at the probability of 2/16. The numbers in the second column of Table 1 are generated by using the statistics of Suzuki et al. (14), and the numbers in the third column are calculated by using our test data set. Because *P*_{Kozak} as defined by the statistics of Suzuki et al. (14) differs from that defined by our data set, we examined both parameters in the test calculation.

Table 2 illustrates how our software “chooses” which ATG to use as the initiation codon and the probability of a randomly generated sequence that has a more preferable initiator ATG (*P*_{atg}) than that of the candidate sequence; “*n*” means that the initiation codon is the *n*th ATG from the 5′ end in all frames. In 60.3% of transcripts, the initiation codon is the first ATG from the 5′ end of the unadjusted sequence, and in 18.1% of transcripts, the initiation codon is the second ATG from the 5′ end. The upstream (5′) ATG is preferred to that downstream, and the average length of the 5′-UTR is 151.6 bp. We are unsure whether the specific ATG or the length of the 5′-UTR is more important in determining the site of initiation. We chose to adopt the specific ATG, and the fraction of *P*_{atg} is set arbitrarily at 7 for simplicity. The factor 7 does not affect the prediction result, because this factor is a constant in *Eq. 1*.

*P*_{codon-pref} is summarized in Table 3. The codons are sorted by the preferred codon usage. Information about preferred codon usage is available by using GCG (8) and at the TransTerm web site (**http://uther.otago.ac.nz/Transterm.html**) (1, 4). Note that we order in light of the preferred codon usage instead of the frequency of codon usage and that *P*×64 (Table 3) is 64 times *P*_{codon-pref}. If the codon includes an unclear base (“*N*”), then *P*_{codon-pref} is set at 32/64 for simplicity. *P*_{Kozak} is not independent of *P*_{codon-pref}; therefore, the probability in *Eq. 1* is double counted. In addition, the Kozak consensus includes the first base of the next codon (e.g., the last **G** of AnnATG**G** belongs to the following codon). When we applied our method to proteins of more than 30 amino acids, the double count negligibly affected V_{a} and therefore is ignored in *Eq. 1*.

Let us consider an example of the calculation of the V_{a} score. Let us suppose there exists a cDNA sequence 5′-TnnATGCTATGAnnGnnATGGAGTGA-3′ that includes the two candidate CDSs *Seq1* (TnnATGCTA) and *Seq2* (GnnATGGAG).

The region of the initiation codon offers 16 variations of the Kozak sequence in the form of XnnATGY, where X and Y are each one of four base pairs (A, G, C, and T). TnnATGC is the rarest Kozak consensus among the 16 options (Table 1); all other sequences are more likely. Thus *P*_{Kozak} for *Seq1* is equal to 16/16. GnnATGG is the second-best consensus around the initiation codon among the 16 sequences, so *P*_{Kozak} for *Seq2* is equal to 2/16. Because *Seq1* is upstream of *Seq2*, *P*_{atg} of *Seq1* is 1/7 and that of *Seq2* is 2/7. CTA in *Seq1* is a rare codon whose frequency of occurrence is the 57th among those of the 64 codons. However, GAG of *Seq2* is the second-most frequent codon among the 64. Therefore, *P*_{codon-pref} of CTA is 57/64, and that of GAG is 2/64. To summarize, *P*_{Kozak}, *P*_{atg}, and *P*_{codon-pref} for each frame are *P*_{Kozak} = 16/16, *P*_{atg} = 1/7, and *P*_{codon-pref} = 57/64 for *Seq1*; and *P*_{Kozak} = 2/16, *P*_{atg} = 2/7, and *P*_{codon-pref} = 2/64 for *Seq2*. The V_{a} scores of *Seq1* and *Seq2* are V_{a}(*Seq1*) = ln(16/16) + ln(1/7) + ln(57/64) = −0.89; and V_{a}(*Seq2*) = ln(2/16) + ln(2/7) + ln(2/64) = −2.95. Thus *Seq2* is more CDS-like than *Seq1*.

#### Preparation of test sample data.

To determine the parameter *D*_{penalty} in *Eq. 3* and to evaluate the prediction accuracy of our method, we needed two sequence sets. One consisted of test sequences with phred scores and frameshift error, and the other contained reference sequences without base-call error. As mentioned in the previous section, *D*_{penalty} is a parameter that must be optimized to give the maximal prediction accuracy.

For the set of reference sequences, we downloaded 612 known mouse complete CDS sequences from the NCBI nonredundant database. The average length of the cDNAs in the test data set is 2,370 bp. We generated the test sequences from these reference sequences by using a Monte Carlo simulation technique. Because phred scores were not available with the sequence data in the NCBI database, we first prepared a phred score for each base of the sequences by using a random number generator. We set the base-call accuracy of the test data at ∼99% (Q ∼ 20), in other words, average *P*_{err} = 1%.

Next, we generated artificial frameshift errors for each reference sequence, and the probability of insertion and deletion depends on the phred score. Note that not all the base-call errors are frameshift errors. We checked the incidence of insertion, deletion, and substitution error in our experimental electropherograms. We observed 51 deletion errors, 38 insertion errors, and 250 substitution errors, for a total of 339 base-call errors; thus deletion errors accounted for 15.0% of all base-call errors, insertion errors accounted for 11.2%, and substitution errors accounted for 73.8%. A stop codon can appear in the ORF by substitution error, and a stop codon can be lost by substitution error [TAA (stop) → AAA (lysine) by T → A substitution]. The probability of stop-codon occurring by substitution error is 4%, and the probability of loss of the stop codon by substitution error is 1–2%, so we ignored the substitution error. Artificial mutation was only the deletion of a base and the insertion of “*N*.” We adopted the following conditions to generate the mutation: 1/6 *P*_{err} (16.7%) is the probability of deletion, 1/6 *P*_{err} (16.7%) is the probability of insertion, and 4/6 *P*_{err} is the probability of substitution error, which is ignored. For simplicity, the factor 1/6 is adopted instead of the actual rate of deletion (15.0%) and insertion (11.2%).

Table 4 shows the population of our test sequences with frameshift error in CDS. Of the 612 cDNA sequences in our test set, 524 (86%) cDNAs carry 773 frameshift errors in the CDS; 14% of CDSs are free of frameshift error. These statistics do not include the insertions and deletions in the 5′- and 3′-UTRs.

In addition, we prepared a set comprising the actual experimental data of 17 sequences (Table 5) from known genes with complete CDSs to show the actual prediction accuracy of the program. These genes were sequenced by using the primer-walking method and ABI377 sequencers (Perkin-Elmer Biosystems, Foster City, CA; **http://www.pebio.com/**). All of the sequences in this data set include frameshift error. The phred score was calculated from the gel-image file of the ABI377 sequencers.

## RESULTS

The prediction accuracy of our program is evaluated by counting the number of predicted amino acid sequences for which percent identity to the correct amino acid sequence is higher than the threshold. We set the identity threshold at 100%, 98%, 95%, 90%, and 85%. After testing various values for *D*_{penalty} to identify the optimal value, this parameter was set at 80. Note that we tried both *P*_{Kozak} parameters (as defined by the statistics of Suzuki et al., Ref. 14, and by our statistics) in the following prediction calculation and found that changing how the *P*_{Kozak} parameter is defined does not affect the prediction result at all. The program does not use the population in Table 1 but the order of the probability (Table 1). The effect of this choice is that the difference due to the order is negligible between two sets of parameters.

Table 6 shows the prediction accuracy of several methods. “Longest frame” refers to the sequence of the longest single ORF of the unadjusted cDNA data. For comparison, we used Genscan, which was developed for analyzing genomic DNA rather than cDNA (2, 3). This program is one of the best exon-prediction programs, and it can be applied to cDNA analysis, offering prediction of the part of the CDS, which is better than the simple choice of the longest frame. Genscan is designed to work with promoter prediction statistics to identify components of the promoter sequence (e.g., the TATA box and the cap site), after which it performs exon prediction. Because cDNA transcripts do not contain the promoter site, Genscan is limited in its applications with cDNA.

We named our program DECODER. As shown in the right column of Table 6, 68–70% of the proteins predicted by DECODER are at least 85% identical to the correct proteins. When the required prediction accuracy is higher than 85%, DECODER has the highest prediction accuracy among the three methods we evaluated. Under the boundary condition of 85% identity, the Genscan result is almost equivalent to that of DECODER. However, when a higher accuracy is required, DECODER achieves a much better prediction score than Genscan. Of the amino acid sequences predicted by DECODER, 35–36% are at least 98% identical to the sequence of the actual protein product, and 43% show ≥95% identity (Table 4). In contrast, only 15% of the predicted proteins generated by using Genscan show more than 98% identity to the actual proteins, and only 69% show more than 85% identity.

Genscan and DECODER can be used together. Genscan is well-suited to finding exons and portions of the CDS that lack ATG and/or a stop codon. However, frameshift error drastically reduces the prediction accuracy of Genscan. If DECODER is used to remove the frameshift error in the test sequence, then Genscan can be applied to the modified test sequence. We found that 73% of the amino acid sequences predicted by the Genscan program offer 85% accuracy.

Of the 612 cDNA sequences that compose the reference sequence set, 524 (86%) carry 773 frameshift errors in the CDS. DECODER corrected 48% of the total frameshift errors in 62% of the 524 cDNA sequences that had frameshift error. The false positive rate of frameshift correction is 9%, and 91% of the suggested frameshifts were true.

Table 5 shows the prediction accuracy of DECODER for the experimental sequence data. The identity of the proteins predicted by DECODER exceeds the threshold (85%) for 8 of the 17 sequences tested. For the first 14 sequences, DECODER predicts a protein that is more closely identical to the actual product than the longest frame. In the three remaining cases, the amino acid sequence predicted by DECODER is longer than the true amino acid sequence. Hence, DECODER is likely to overestimate the length of the CDS.

One drawback of our procedure is that it is very time-consuming. In the presented test of 612 sequences, the processing time of DECODER was 45 s per sequence on a DEC Alpha personal workstation (600 MHz). If the size of the sequence is *M* bp, then the computational time of the conventional method of exon prediction is proportional to *M*. In contrast, the computational time of our method is proportional to *M*^{2} when two artificial frameshifts are used in the trial-and-error procedure. Thus the conventional method can be used for genomic as well as cDNA sequences. Because the time-consuming nature of our method is proportional to the length of the sequence evaluated, our method is not well-suited to analysis of genomic sequences. However, cDNAs are typically shorter than genomic sequences. Therefore, our method realistically can be applied to studying cDNA sequences, for which purpose we created the DECODER program.

We compared also the results of DECODER and ESTScan, which is a program that can detect coding regions in DNA sequences, even if they carry frameshifts. Also, ESTScan can detect and correct frameshift errors. Since we used the ESTScan on a web site of the European Molecular Biology network for this comparison (**http://www.ch.embnet.org/software/ESTScan.html**), a smaller number of test sequences was desirable rather than the 612 sequences of the previous test data set. The new test subdata set was composed of randomly selected 50 sequences out of the 612 sequences of the test data set. The average length of the cDNAs in the test subdata set was 2,169 bp. Of the 50 cDNA sequences that composed the reference sequence set, 47 (86%) carry 56 frameshift errors in the CDS. The size and error rate of the test subdata set reflects that of the original data set of 612 sequences.

ESTScan could not detect CDS for 3 sequences (6%) of 50 sequence data. DECODER predicted a CDS for all of the sequences, since this program is not an exon prediction program but it shows the most probable CDS in any sequence. ESTScan and DECODER corrected 86% and 54% of the total frameshift errors, and the false positive rates of frameshift corrections were 44% and 7%, respectively. ESTScan and DECODER predict the amino acid sequences of 80% and 88% identical to the correct amino acid sequences on average, respectively. ESTScan predicted only 4% of the start codon correctly (the author of ESTScan mentions this problem at the web home page), whereas DECODER predicted 80% of the start codon correctly.

## CONCLUSIONS

The amino acid translation program DECODER is useful for evaluating full-length cDNA sequences with experimental frameshift errors early before the completion of a given project. This program can suggest the position of a frameshift, predict an amino acid sequence, and evaluate the likelihood that the deduced amino acid sequence is that of the actual protein product.

To remove frameshift error, the program tries to make insertions and/or deletions in the experimental sequence, and the candidate CDS is predicted for every trial. If the trial insertion/deletion removes the frameshift error, then a CDS-like ORF emerges. The software chooses the sequence most likely to represent that of the actual protein product. This likelihood is reflected as the V_{a} score, which is calculated on the basis of the Kozak consensus, the preferred codon usage, and the position of the initiation codon. The more likely the predicted sequence is the actual sequence, the lower the V_{a} score.

Our data demonstrate that DECODER shows high accuracy for predicting CDS and frameshift error of full-length cDNA. With respect to the predicted amino acid sequence, Genscan yields almost equivalent results to those of DECODER at low thresholds of identity (≤85%). However, when a higher accuracy of the predicted amino acid sequence is required, DECODER achieves a much better prediction score than does Genscan.

Using DECODER and Genscan in concert capitalizes on the relative strengths of these methods. DECODER can remove the frameshift error of unadjusted cDNA sequences; the corrected cDNA sequences are then submitted to the amino acid prediction algorithm of Genscan, which does not require as much computation time as DECODER does. When Genscan predicts proteins for DECODER-corrected cDNAs, 73% of the predicted proteins show more than 85% identity to the actual proteins. DECODER can detect and remove 48% of the total frameshift errors. Thus 62% of cDNA with frameshift error were detected, and at least one of the frameshift errors was corrected. The false positive rate of frameshift correction is 9%, and 91% of the suggested frameshifts were true.

Farabaugh et al. (7, 12) suggested that some RNA sequences program the ribosome to alter the reading frame efficiently to allow for the expression of alternative translational products. Sites that cause the ribosome to shift frames, termed programmed frameshift sites, occur in organisms from bacteria to higher eukaryotes. Medigue et al. (11) reported that their frameshift detection program detected these natural frameshifts. Distinguishing between natural and artificial frameshifts is difficult, and neither DECODER nor any other exon-prediction method overcomes this shortcoming.

ESTScan overestimates the frameshift errors, and DECODER underestimates the frameshift errors, and therefore DECODER shows lower false positive than ESTScan, and, vice versa, ESTScan shows lower false negative frameshift detection than DECODER. Note that there are parameters for modifying the selectivity and sensitivity in both ESTScan and DECODER, and the result depends on these parameters. The significant difference between these two programs is that the purpose of ESTScan is detection of CDS in DNA sequences and that of DECODER is prediction of amino acid sequence and frameshift errors in full-length cDNA. If it is unknown whether the DNA sequence is full-length cDNA, then ESTScan can be applied. In this case, DECODER cannot be applied, since DECODER is not an exon prediction program. If the sequence is known to be a full-length cDNA, then DECODER can show better amino acid prediction results than Genscan and ESTScan.

DECODER can be applied to the forthcoming uncorrected full-length cDNA sequences provided by the RIKEN Encyclopedia Project and the Mammalian Gene Collection of the National Institutes of Health. In the transcriptome analysis, the highest priority is given to the comprehensive collection of unadjusted cDNAs, especially the collection of de novo transcriptional sequences. The CDS and the initiation codon in the cDNA sequence can be predicted by comparing the unadjusted full-length cDNA sequence with the genomic DNA sequence. DECODER can be modified by using this information. The current trial-and-error algorithm of DECODER is sufficiently time-consuming that using it to analyze genomic sequences is impractical. To address this disadvantage, an alternative choice of algorithm for the future is the hidden Markov model, which can be robust against the frameshift error.

## Acknowledgments

We thank Dr. C. Iseli for suggestions. We thank Shiro Fukuda and Hiroshi Minami for support, and we thank the members of the RIKEN Genome Science Center for the data preparation.

This study has been supported by Special Coordination Funds and a Research Grant (to Y. Hayashizaki) for the RIKEN Genome Exploration Research Project, Core Research for Evolutional Science and Technology (CREST), and Research and Development for Applying Advanced Computational Science and Technology (ACT-JST) of Japan Science and Technology Corporation (JST) from the Science Technology Agency of the Japanese Government. This work also was supported by a Grant-in-Aid for Scientific Research on Priority Areas and the Human Genome Program from the Ministry of Education, Science, and Culture of Japan and by a Grant-in-Aid for a Second-Term Comprehensive 10-Year Strategy for Cancer Control from the Ministry of Health and Welfare of Japan (to Y. Hayashizaki).

## Footnotes

Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).

Address for reprint requests and other correspondence: Y. Fukunishi, 1-1, Higashi/Tsukuba, Ibaraki 305-0074, Japan (E-mail: rgscerg{at}gsc.riken.go.jp).

- Copyright © 2001 the American Physiological Society