Here we present a novel web tool for the statistical analysis of gene expression data in multiple tag sampling experiments. Differentially expressed genes are detected by using six different test statistics. Result tables, linked to the GenBank, UniGene, or LocusLink database, can be browsed or searched in different ways. Software is freely available at the site: http://telethon.bio.unipd.it/bioinfo/IDEG6_form/, together with additional information on statistical methodologies.
- expressed sequence tags
- serial analysis of gene expression
- statistical test
identification of differentially expressed genes in human tissues is relevant not only for its intrinsic biological significance but also for discovering potential pharmaceutical targets and diagnostic or prognostic markers.
Tissue-specific gene expression levels can be estimated by using frequencies of gene transcripts in unbiased cDNA libraries or cDNA samples (5). According to this view, the study of genomic expression was attempted by serial analysis of gene expression (SAGE) (10) and by cDNA array technology (6). Public domain data on cDNA libraries can be accessed at the UniGene database (8), whereas SAGE data are deposited in the Gene Expression Omnibus Database (3, 11).
When dealing with analysis of large amounts of expression data, the main problem is sensitivity and specificity of statistical tests used to detect which genes appear to be differentially expressed in different conditions.
During last years, several test statistics have been developed for detecting differentially expressed genes in multiple tag experiments (2) [quantitative data based on expressed sequence tag (EST) sequencing or on SAGE libraries]: the Audic and Claverie test statistic (1), Fisher’s exact test [used in digital differential display (DDD) at the Cancer Genome Anatomy Project (CGAP)] (2), the Greller and Tobin test (4) for comparing expression levels in more than two libraries, and the likelihood ratio test statistic R (9).
In a recent work we applied a simulation approach to verify which tests are more sensitive and are able to detect a larger number of true positive in different conditions (7). General χ2 test was shown to be fairly efficient in multiple tag sampling experiments, especially when dealing with variations affecting weakly expressed genes, whereas the method of Audic and Claverie was shown to be the most adequate for detecting differences in gene expression in pairwise comparisons.
The use of statistical tests for the detection of differentially expressed genes is of crucial importance for the identification of a set of deregulated genes in two or more conditions and also to manage and organize large amounts of data.
Here we present a public domain web tool for detecting differentially expressed genes in multiple tag sampling experiments. The user has the possibility to set parameters for the statistical analysis through a simple web form. Result tables are visualized, and cells reporting significant P-value are shaded in yellow, whereas cells reporting normalized tag counts are shaded with different degree of blue.
Furthermore, each gene is linked to different genomic databases. Results can be browsed and searched in different ways to extract relevant information.
Detailed information on form setting, statistical theory, and result interpretation is available on dedicated help pages, along with a specimen analysis.
The algorithms of statistical tests for the detection of differentially expressed genes were implemented in a C program, which allows parallel run of these analyses on a numerical matrix of gene expression data. The output is a matrix-like table of P-values giving, for each gene, the probability of differential expression according to different test statistics.
Three tests are suitable for pairwise comparisons of genes expression levels in different tissues or conditions, whereas another three perform multiple comparisons.
1) The Audic and Claverie test gives the conditional probability of observing y number of tags in library B, given that x tags have been observed in library A, if NA and NB are the total number of tags for, respectively, library A and B, under the hypothesis of equal expression of the considered gene in conditions A and B. The smaller the conditional probability, the more differentially expressed is the gene over the two libraries.
2) The Fisher exact test considers a typical two-way contingency table for tag sampling experiments, in which the number of tags of a given gene in two libraries or conditions is tested against the total number of tags for all the remaining genes in the data set.
3) The chi-squared test (X2) for pairwise comparisons, adopting the χ2 asymptotic distribution for the calculation of the P value, is applied to two-way contingency tables.
1) The R statistic is a likelihood ratio test associated with the hypothesis of equal expression level through the libraries. It tends to a χ2 distribution as the number of total tags tends to infinity.
2) The Greller and Tobin test uses a decision function, whose values may be used as ranks for the selection of genes with strong, moderate, or weak evidence of differential expression. It is particularly indicated for outliers detection.
3) The chi-squared test (X2) for multiple comparisons (general chi-squared test) uses contingency tables with more than two libraries; tag numbers per gene in all the considered libraries are compared with the total number of tags of all the remaining genes over all libraries. It uses a χ2 asymptotic distribution for the calculation of the P value.
A Perl common gateway interface (CGI) script is associated to the form: it performs the parsing and the checking of the input file or data; it then invokes the C program implementing the test statistics and builds results both as text files to be later downloaded by the user and as HTML pages for immediate visualization through the web. Result pages contain a summary of the performed statistical analysis and of the chosen conditions, as well as a summary of the obtained results and links to genomic databases.
Moreover, each result page contains links to a dedicated CGI tool that allows the user to browse and search result tables in several different ways.
RESULTS AND DISCUSSION
The IDEG6 web interface allows running six different statistical analyses for the detection of differentially expressed genes in multiple tag experiments. Tutorial pages helping compilation of the form are available at the same web site, providing detailed information about the test statistics and sample analysis. In addition, bibliographic references about software, the statistical theory, and biologically relevant information is enclosed.
The input file is a matrix of gene expression data, with m rows (genes) and n columns (different tissues, libraries, or conditions). The first line specifies the total tag count per column. Each gene in the matrix, identified by accession number and description, can be hypertextually linked to the corresponding entry in the current (last) version of selected database. In addition, a significance threshold may be selected for each test.
Output is a list of genes detected to be differentially expressed by at least one selected test. In addition, complete list of the input genes and of calculated P-values can be downloaded in HTML or in plain text format.
In case of incorrect input or incongruent settings (e.g., missing data or non numeric values), alert messages help the user to compile the form in a correct way.
The form has several fields to be filled (Fig. 1A), some of which are mutually exclusive. In particular the user should 1) paste data into a text box or upload the input data matrix; 2) select at least one of the six available test statistics; for each test, a significance threshold must be chosen among values ranging from 0.05 and 0.00001; 3) opt for a database, among GenBank, UniGene, or LocusLink, to which considered genes will be linked, according to the accession number supplied by the user as identifier of genes in the original matrix; 4) choose whether to apply the Bonferroni correction to the threshold levels.
Lines of the input file must be space delimited and must contain the following data types, in this order: gene accession number, gene description, and the number of tags for each library. A mandatory first line must be “UNIQID Description Tot_TAG_lib_1 … Tot_TAG _lib_n”, where, “Tot_TAG_lib_i” with i=1,…,n is the total number of tags in the i-th library. The total number of tags per library will be used for normalizing the gene expression values.
An example of a small input file is In this case the file has 3 genes with UniGene accession number and description; the total number of tags is 12,500 for the first library, 6,000 for the second, and 5,800 for the third.
The first part of results obtained by IDEG6 (Fig. 1B) consists of a summary of the user choices, such as selected test statistics. Each one has the associated, possibly corrected, significance threshold, and number of genes and of libraries in the input file. Then, a table reporting differentially expressed genes only is shown (Fig. 1B). For each gene, a row reports the identification number, linked to the corresponding entry to the selected database, gene description, raw and normalized tags number in each tissue/condition, and the P-values calculated for the selected test statistics.
The cells reporting normalized tag values are shaded with different degree of blue representing the extent of gene expression. Cells reporting significant P-values are shaded in yellow.
A similar data table showing results pertaining to all the input genes, independently from the fact that they are significantly differentially expressed, can be downloaded in HTML or in tab-delimited plain text format.
The system was designed to help the user to extract relevant information even if the data set is considerably large. Results pertaining to differentially expressed genes and results for the complete set of input genes are stored in separate pages. A search tool allows the user to find specific genes or to browse through result tables, generating subtables. Identification numbers can be used to search for a specific gene, whereas using keywords can retrieve groups of genes with similar function or pertaining to a gene family. For example, one could search for all the genes coding for kinases that are differentially expressed among at least two input tissues or conditions.
The considerable number of large-scale genomic studies is enormously increasing the amount of data available on gene expression. Data of tag sample experiments, such as cDNA libraries construction, systematic sequencing, or SAGE analysis are available for different organisms. Human tissues under normal and diseased conditions are heavily represented and an increasing number of data is also being produced on model organisms important for genetics and genomics aspects. Furthermore, the study of gene expression in animal or plant species with economic potential is contributing to the field.
The strong effort of data production is not equally compensated by the availability of free and versatile software for data analysis, which could allow the selection of genes significant for their expression patterns and managing genetic information of thousands of genes.
IDEG6 is a free web tool for the identification of differentially expresses genes in multiple tag sampling experiments and for the organization and managing of such large amount of data. The possibility to select different test statistics, characterized by variable conservativeness and the versatile setting of several search parameters, allows one to handle even very large data sets and, in particular, to reduce the dimensionality of gene expression data for subsequent analysis. These analyses, based on genetic/functional properties of genes and of their products, are facilitated by the linking of the results to different databases and by an advanced search tool.
We gratefully acknowledge the financial support of Italian Ministry of Education, University and Scientific Research (to G. A. Danieli) and of Italian Association for Cancer Research (S. Bortoluzzi).
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).
Address for reprint requests and other correspondence: C. Romualdi, Univ. of Padua, CRIBI Biotechnology Centre via G. Colombo 3, 35131 Padua, Italy (E-mail:).
- Copyright © 2003 the American Physiological Society