Physiological Genomics

Increased measurement accuracy for sequence-verified microarray probes

Brigham H. Mecham, Daniel Z. Wetmore, Zoltan Szallasi, Yoel Sadovsky, Isaac Kohane, Thomas J. Mariani


Microarrays have been extensively used to investigate genome-wide expression patterns. Although this technology has been tremendously successful, it has suffered from suboptimal individual measurement precision. Significant improvements in this respect have been recently made. In an effort to further explore the underlying variability, we have attempted to globally assess the accuracy of individual probe sequences used to query gene expression. For mammalian Affymetrix microarrays, we identify an unexpectedly large number of probes (greater than 19% of the probes on each platform) that do not correspond to their appropriate mRNA reference sequence (RefSeq). Compared with data derived from inaccurate probes, we find that data derived from sequence-verified probes show 1) increased precision in technical replicates, 2) increased accuracy translating data from one generation microarray to another, 3) increased accuracy translating data from oligonucleotide to cDNA microarrays, and 4) improved capture of biological information in human clinical specimens. The logical conclusion of this work is that probes containing the most reliable sequence information provide the most accurate results. Our data reveal that the identification and removal of inaccurate probes can significantly improve this technology.

  • Affymetrix
  • Agilent
  • reference sequence
  • RefSeq

comprehensive gene expression profiling using microarray technology has facilitated a revolution in the characterization of cellular regulation and shows great potential for human disease diagnostics. Microarrays have been successfully used to identify the targets of transcription factors (7, 41) and secreted regulatory molecules (25, 37). Transcriptional profiling has also lead to the identification of molecular mechanisms involved in animal models of disease (13, 42). The technology has been successfully used for the identification and/or classification of disease (6, 8, 18, 36) and has also provided insight into regulatory networks contributing to developmental processes (3, 14, 21, 23, 29, 32, 35, 39).

Commercial microarrays have been widely available, and the Affymetrix oligonucleotide microarrays (GeneChip technology; Ref. 19) represent a large proportion of previously published studies. GeneChip technology utilizes multiple independent probe hybridization events to measure the expression level for each gene investigated. Additionally, for each hybridization event, the technology utilizes 25-nt oligomers (25-mers) and corresponding single nucleotide mismatch 25-mers to measure specificity. The individual 25-mers are derived from publicly available nucleotide sequence information.

There has been tremendous success applying microarray technology to disease diagnostic applications. For instance, multiple groups have shown that microarray data can identify previously unappreciated molecular subtypes of lung cancer that differ in their prognoses (1, 2, 5). Unfortunately, poor reproducibility of results exists across these studies. This indicates either an underlying distinction in the nature of the disease investigated or, more likely, a limitation of the technology in reliably capturing the underlying biology. Microarray technology has been criticized for a lack of individual measurement accuracy. However, the technology is rapidly advancing, and improvements in reagents (11) and data analysis (12, 16) have increased measurement precision. Some sources of noise, such as those due to hybridization intensity differences, are systematic and have been successfully defined (4, 9, 12, 15, 20, 33, 40).

As a tremendous volume of data has been generated (particularly from human clinical specimens, which cannot be duplicated) strategies to improve analysis of (“clean up”) existing data sets are of great value. One limitation of the application of this technology could be due to the failure of similar studies to measure identical biological parameters. For instance, Sorlie et al. (31) recently showed that limiting analysis of data to probes that are verified to query identical UniGenes improves concordance of results from one microarray data set to another.

Probe sequence inaccuracies are known to exist for both oligonucleotide and cDNA microarrays. However, there is a general lack of information regarding the scope of probe sequence inaccuracies on currently available Affymetrix platforms. In this study we report a global analysis of the probes used by Affymetrix technology, where we have systematically attempted to confirm the accuracy of individual probe sequences. We find that for a significant number of probe sets, on both old and current platforms, the probe sequences do not perfectly correspond with the appropriate mRNA as defined by the reference sequence (RefSeq). Given the approach Affymetrix uses to determine true mRNA hybridization from background (e.g., single nucleotide mismatch), any sequence discrepancy likely renders the probe uninformative. Furthermore, we report that data derived from sequence-verified probes shows vastly improved precision. Therefore, removing information from inaccurate probes should significantly improve the validity of results from this technology.


Database Information and Probe Verification

All mRNA sequences were retrieved from the NCBI UniGene molecular database. UniGene builds 162 (Sept. 16, 2003), 131 (Dec. 4, 2003), and 125 (Dec. 5, 2003) were used for the human, mouse, and rat genomes, respectively. Affymetrix probe set annotation information was obtained from the Oct. 15, 2003, release of NetAffx (17). Annotation information included individual probe sequences as well as the RefSeq identifier used to relate probe sets to mRNAs. A RefSeq represents a curated, nonredundant transcript sequence (26, 27). The location of all Affymetrix probe sequences was identified in their corresponding mRNAs with the use of ProbeMapper, a Perl script developed specifically for this procedure. “Perfect match” (PM) and “mismatch” (MM) probe sequences that overlap RefSeq information were considered “verified.” Probe sets were considered verified if at least one probe sequence was an exact match with the corresponding RefSeq. We chose such a relaxed classification scheme for probe sets in order to account for the incompleteness of 5′ and 3′ untranslated regions (UTR) in the RefSeq database. Agilent probe information, including the GenBank sequence ID, was provided by the manufacturer. All outdated Agilent clones were removed prior to analysis.

Data Processing and Probe Matching

For each Affymetrix microarray experiment, array image files were analyzed with Affymetrix Microarray Suite ver. 5 (MAS5) software, and signal intensities were calculated for MAS5, robust multi-array averaging (RMA) (12), and DNA-Chip Analyzer (dChip) (16) with the aid of Bioconductor software. Expression values for Agilent cDNA data were calculated using the manufacturer’s standard software and normalization procedures. Correlations were calculated using MATLAB, and clustering diagrams were generated by the TIGR Multiexperiment Viewer (MeV) (28) software package.

When comparing measurements across generations of Affymetrix platforms, probe sets were matched if they contained at least one probe that corresponds to the same UniGene. When comparing measurements between Agilent and Affymetrix technologies, probe sets were matched if the Affymetrix probe set was for a UniGene containing an Agilent cDNA sequence.

Microarray Data Sets and Analysis Details

Affymetrix technical replicate experiment.

The microarray data set used for this analysis has been previously described (20) and is available at the NCBI Gene Expression Omnibus (GEO) (GSE1302). Three independent mRNA samples were studied on the Hu95v2 platform (subarrays A-E). For each sample, a single mRNA target was generated and analyzed using five (sets of U95Av2, U95Bv2…) microarrays. Therefore, for each subarray tested, the three conditions produced three independent sets of five technical replicate samples. Each set of five replicates generated ten comparisons (e.g., replicate 1 vs. 2, 1 vs. 3, …, 4 vs. 5). The probe sets for each platform were classified as verified or unverified as described above. Pearson and Spearman correlations were calculated for each set of ten replicate comparisons, for either the verified or unverified probe sets, using MAS5.

Breast cancer cell line experiments.

This microarray data set has been previously described (39) and is available at GEO (GSE1299). For the microarray data set used for this analysis, mRNA from two cancer cell lines and one normal cell line were hybridized to replicate Affymetrix Hu95Av2, Hu133A, and Hu133B and Agilent Human cDNA arrays. Each replicate experiment was compared with a similar experiment on the other platforms (e.g., cancer cell line 1, replicate 1 on the Hu95Av2 platform was compared with the cancer cell line 1, replicate 1 on the Hu133A, Hu133B, and Agilent platforms), resulting in six independent comparisons for each platform. Probe sets were matched across Affymetrix platforms and between Affymetrix and Agilent technologies as described above.

When comparing data across Affymetrix platforms, Pearson and Spearman correlation coefficients were calculated from signal intensity data for the matching verified and unverified probe sets for each of the six comparisons. As Agilent technology reports expression levels as a ratio between two samples, when comparing across technologies the Affymetrix data had to be transformed. Here, the expression level for each Affymetrix probe set was transformed into the log base 2 of the ratio between its signal intensity in a cancer sample with its signal intensity in a normal sample. Pearson and Spearman correlation coefficients were calculated from signal intensity data for both the verified and unverified probe sets.

Breast cancer tissue experiments.

Both breast cancer studies were performed on the U95Av2 and have been previously described. The Dana Farber Cancer Institute (DFCI) study included 101 breast cancer tumors and 8 normal breast tissue samples (30) and is publicly available ( The Duke University study included 89 breast cancer tumors (38) and is publicly available ( A 198 × 12,626 matrix was generated that contained RMA signal intensity values for each probe set in all samples from both the Duke and DFCI studies. hierarchical average linkage clustering (2), with the Pearson correlation as the distance metric, was preformed on the following three sets of data: 1) The 1,000-probe sets that exhibited the largest standard deviation relative to their mean intensity, 2) the subset of these 1,000-probe sets that are RefSeq-verified, and 3) the subset of these 1,000-probe sets that are unverified. Note that the Affymetrix control probe sets (e.g., AFFX-PheX-M_at) were excluded from the clustering so that no bias was introduced by different processing methods that either center might have used.

Significance Testing

Because of hybridization quality variation, the significance of the difference in correlations between sequence-verified and -unverified probe sets was determined using a paired t-test.


Probe Sequence Verification

Inaccuracy of probe sequences has been previously recognized in both custom and commercial microarray platforms. However, an investigation of the magnitude and scope of this problem for Affymetrix technology has not been previously reported. In this study, we performed a systematic evaluation of the accuracy of the sequences of the individual probes on 20 of the most commonly used Affymetrix mammalian microarray platforms available as of June 2003. For each platform, we identified all exact probe sequences within each appropriate UniGene database. For every mRNA sequence in a UniGene database, our algorithm identified all 25-nt subsequences that are identical to either a PM, MM, reverse complement PM or reverse complement MM probe sequence. Because of the nature of the technology (e.g., use of single nucleotide mismatch as a control), we only identified probes that exactly match a 25-nt-length subsequence of a transcript. For every identified probe sequence, we defined its inclusion, location, and orientation in the RefSeq database (26, 27). We used the RefSeq, as it is the most highly curated, publicly available definition of mRNA transcript sequences for entire genomes.

We found that a very high proportion of the individual probes could not be verified as measuring their associated RefSeq (Supplemental Table S1, available at the Physiological Genomics web site).1 Of particular note, the percentages of verified probes on the more recent platforms were no greater than those for the older platforms: U133A, 72% vs. HuFL, 80%; 430A, 75% vs. Mu11kA, 74%; and 230A, 80% vs. U34A, 81%. Moreover, for each platform that contains a set of arrays (e.g., Hu95A–E), there was always a decrease in the percentage of verified probes on the secondary array(s) (e.g., Hu95B–E, max 47%) compared with that of the primary array in the set (e.g., Hu95A, 72%). Interestingly, platforms with higher verification rates generally contained fewer EST sequence-derived probes.

As Affymetrix technology utilizes “sets” of probes to interrogate a single gene, we investigated the distribution of verified probes into probe sets (Fig. 1). A probe set was classified as either 1) entirely verified if it contained only verified probes, 2) entirely unverified if none of its probes were verified, or 3) partially verified if some, but not all, of its probes were verified. For human platforms, we found at best 62.5% (U133A) to at worst 13.8% (U95D) of probe sets were entirely verified. Again, the older and newer platforms showed similar percentages of entirely verified probe sets. Probe sets were generally either entirely verified or entirely unverified, with only 10% (U133B) to 24.7% (HuFL) being partially verified. Additionally, a large number of the partially verified probe sets contained one or two unverified probes. For U133A, more than 50% of the partially verified probe sets have 9 or 10 out of 11 probes verified (Supplemental Fig. S1). This distribution of probes in the partially verified probe sets indicates a preference for sets of probes to identify their intended target.

Fig. 1.

Proportion of sequence-verified probes sets. The distribution of probes sets on all platforms, classified as 1) entirely verified when all probes corresponded to the RefSeq mRNA sequence (black bars), 2) entirely unverified when none of the probes corresponded to the mRNA sequence (open bars), or 3) partially verified when a subset of probes corresponded to the mRNA (gray bars).

There were numerous reasons for the lack of probe sequence verification. An obvious source is continuously evolving/improving sequence database quality. For example, probe set 220154_at on the U133A platform was designed to query UniGene Hs.443518 and RefSeq NM_020388. As of UniGene build 133, NM_020388 was a 1,022-nt mRNA sequence, and all 11 probes for 220154_at were accurate. This RefSeq has subsequently been updated to a 9,348-nt mRNA sequence with no significant similarity to the older sequence (as defined by BLAST). As a result, the probes for 220154_at no longer measure their intended target. Presently, 4 of the 11 probes measure NM_015548, an alternative transcript for Hs.443518. An additional source of inaccuracy is improper annotation of where transcription starts and ends. For example, probe set 202172_at on the U133A platform is intended to monitor NM_007146. However, its probes map to a region ∼1,800–2,200 bp 3′ of the transcribed region for NM_007146. Other inaccuracies arise from probes derived from the noncoding strand. Probe set 219533_at on the U133A platform was designed to monitor NM_000076; however, all 11 probes measure the reverse complement of this sequence. As stated above, not all probe sets are composed of entirely verified or unverified probes. One example is probe set 208256_at on the 133A platform, which was designed to measure NM_001405. Currently, four of its probes measure the sense strand, two measure the antisense strand, three are found downstream of the 3′ end of the coding sequence, and the remaining two cannot be identified in the transcript sequence or in the region 10-kb on either side of the transcribed region. This probe set exemplifies the various problems that can occur as a result of database evolution, improperly defined sequence boundaries, and confusion over sense and antisense transcripts. Complete probe sequence mapping information, and verification results concerning individual probe sets on all Affymetrix platforms described here, is available for academic use at (Supplemental Fig. S2).

Replicate Precision of Verified Probe Sets

Although we may have decreased confidence in the reliability of unverified probes, these probes may not necessarily show a difference in measurement precision. We sought to test whether precision was related to probe sequence accuracy using data from a previously published replicate microarray data set (20), including three mRNA targets each hybridized to five individual microarrays on each of the U95 (A, B, C, D, and E) platforms. The three conditions produced three sets of five technical replicate data sets, with each set of replicates generating ten comparisons (e.g., replicate 1 vs. replicate 2, 1 vs. 3, … 4 vs. 5). Verified probe sets were defined as those containing at least one verified probe and unverified probe sets contained no verified probes. We focused upon a comparison of the behavior between these two groups. There are two major benefits of this approach: 1) it provides a measure of “quality” for both the verified and unverified probe sets, and 2) the independence of the verified and unverified sets allow for a statistical comparison to be made.

Pearson and Spearman correlation coefficients were calculated for replicate measurements for both verified and unverified probe sets on each subarray using MAS5 data (Table 1 and Supplemental Table S2). For data generated on U95A, Pearson correlation coefficients ranged from 0.949–0.991 for verified probe sets and 0.883–0.987 for unverified probe sets, with P values of 0.014 to 0.000076. Spearman correlation coefficients showed a greater increase in accuracy for verified probe sets. The consistently significant increase in the correlations derived from verified probe sets indicated that probe accuracy affects data reproducibility.

View this table:
Table 1.

Increased measurement precision for sequence verified probes in technical replicates

The dramatic difference in Spearman correlation coefficients led us to investigate intensity differences between the two groups of data. Mean signal intensities for nonverified probe sets were less than those of verified probe sets (although not significantly), suggesting that removal of unverified probes might be equivalent to filtering low-intensity signals. We removed all unverified probe sets signals with a maximum intensity value less than 300 and recalculated Pearson correlation coefficients. After filtering low-intensity probe set signals, verified probe sets showed a similar significant increase in correlation (Supplemental Table S3) compared with unverified probe sets. The increased correlations of verified probe sets were not dependent upon signal intensity as shown by the data from the U95B platform, where removal of low signal intensity data from unverified probes resulted in a greater mean signal intensity for this group. Additionally, for U95A replicates (performed on treatment 1 data), we limited the unverified probe set data to the 500 probe sets with the largest mean intensities and recalculated the replicate correlations. This unverified probe set subgroup showed a greater mean signal intensity (1,919 vs. 1,409), but significantly lower correlations (verified, 0.991–0.947 vs. unverified, 0.986–0.916; P < 0.00001) than that for the verified probe sets.

These differences were somewhat surprising, given the measurements were from technical (and not experimental) replicates. However, it is likely that unverified probes capture background hybridization to one or more transcripts. Whether individual unverified probes measure background, an alternate transcript, or multiple transcript variants, it is intuitive that unverified probe sets would have greater variability than probe sets that are verified to measure a single transcript. The difference in correlations clearly indicated a difference in measurement reproducibility and led us to further test the effects of probe sequence accuracy on data reliability.

Accuracy of Verified Probe Sets Across Platforms and Technologies

Having shown that verified probe sets capture information with greater precision in replicate experiments performed on one platform, we tested the effects of sequence verification on measurement accuracy across complementary microarray technologies. We used a data set consisting of experiments on each of two breast cancer cell lines and one normal mammary epithelial cell line (22). For each cell line, replicates were performed on the U95A, U133A, U133B, and Agilent cDNA microarray platforms.

We tested the effect of sequence verification upon measurement accuracy across generations of Affymetrix platforms. Probe sets from the U95 and U133 platforms were matched using UniGene identifiers. Probe sets that were shared across platforms were classified as verified if at least one probe matched its RefSeq. For example, UniGene Hs.20952 is measured by two probe sets on both the Hu95Av2 (37761_at, 33386_at) and Hu133A (209821_s_at, 208886_at) platforms. However, the RefSeq for this UniGene, NM_001682, is only measured by 37761_at and 209821_s_at. Therefore, we classified these probe sets as verified, whereas 33386_at and 208886_at were classified as unverified. Spearman and Pearson correlation coefficients were calculated for both the verified and unverified (matched) probe set signal intensities. To confirm that the observed effects were not due to specific data processing algorithms, we repeated the analyses using MAS5, RMA, and dChip.

For each comparison between two platforms there were six correlations calculated. Consistently significant (P < 0.00001) increases in correlations were observed for verified probe sets compared with unverified probe sets (Table 2 and Supplemental Fig. S3A) using all analysis methods. For example, P values for the Pearson correlation coefficients in the U95A to U133A comparison were all less than 0.000001 for MAS, RMA, and dChip. The differences were unrelated to sample size, as RefSeq-verified probe sets comprised a majority in the U95A to U133A comparison and a minority the U95A to U133B comparisons. An interesting aspect of this analysis is the extremely low correlations for unverified probe sets in the U95A to U133B comparison, likely due to the high amount of EST-derived probe sequences on the U133B platform.

View this table:
Table 2.

Increased measurement accuracy for sequence verified probes across multiple versions of Affymetrix platforms and across Affymetrix and Agilent cDNA technologies

As in the technical replicate experiment, verified probe sets had higher mean signal intensities than unverified probe sets. Again, filtering out low signal intensity data did not remove the improvement in correlations for RefSeq-verified probe sets (Supplemental Table S4). For example, P values for the Pearson correlation coefficients in the U95A to U133A comparison were less than 0.000001 for MAS, RMA, and dChip. These data clearly show that removing unverified probes are not the equivalent of removing low signals.

Another technique used by some researchers to remove low-quality information is to utilize the Affymetrix “detection call” algorithm. We investigated the rate of “absent” and “present” calls for both the verified and unverified probe sets. For each of the U95A, U133A, and U133B platforms the percentage of verified and unverified probe sets called present, absent, or “marginal” were calculated (Supplemental Table S5). Verified probe sets were scored present at a higher rate than unverified probe sets (56% vs. 45%) for U95A (58% vs. 46%), for U133A, and (53% vs. 34%) for U133B. However, the on-average greater than 40% score of present for unverified probe sets clearly indicates using the detection call as a basis for removing low-quality information from microarray experiments does not compensate for probe sequence inaccuracies. Additionally, we recalculated the correlations for the verified and unverified probe sets between the Hu95A and Hu133A platforms after removing all probe sets not consistently scored present. Verified probe sets retained significantly higher correlations than unverified probe sets (P < 0.00001, data not shown). These data further support the conclusion that these probes are measuring a nonspecific transcript, albeit at lower reproducibility, and this likely explains the improved measurement precision of verified probe sets in the technical replicate experiment.

Finally, we investigated the effects of sequence verification on measurement accuracy across Affymetrix and Agilent cDNA microarray technologies. There is greater concordance between cDNA and Affymetrix data when the probes used to measure expression contain similar information (22). Therefore, we only included probe sets from the U95 or U133 platforms if they overlap a sequence used on the Agilent array. Probe sets that matched Agilent cDNA probe sequences were defined as verified when at least one Affymetrix probe was RefSeq verified. For example, we identified a match between probe set 207038_at on the U133A platform with Agilent clone U79745. All 11 Affymetrix probes from probe set 207038_at were verified in RefSeq NM_004694 and the GenBank mRNA for U79745. Probe sets that matched Agilent cDNA probe sequences were defined as unverified when no Affymetrix probe was RefSeq verified. For example, we identified a match between probe set 215264_at on the U133A platform and Agilent clone X68879. None of the probes from probe set 215264_at was contained in a RefSeq. As described in the materials and methods, log base 2-transformed Affymetrix expression ratios were correlated with sequence-matched cDNA expression ratios. We used MAS5, RMA, and dChip expression values to control for any bias introduced by various data processing algorithms.

Again, significantly increased correlations were consistently observed for sequence-verified probe sets (Table 2 and Supplemental Fig. S3B). For example, P values for the Pearson correlation coefficients in the U95A to Agilent comparison were 0.003 for MAS5, 0.022 for RMA, and 0.019 for dChip. Spearman correlation coefficients for U133A to Agilent were equivalent to Pearson correlation coefficients, with similar P values. However, when comparing data from U133B to Agilent, Spearman correlation coefficients were dramatically lower and not significantly increased for verified probe sets. This suggests poorer overall quality measurements on the U133B subarray, with improvements due to sequence verification being insufficient to reach significance.

Diagnostic Accuracy of Verified Probe Sets

In addition to the value of expression profiling in understanding biological mechanism, this technology has been used as a predictive measure for defining disease states. One limitation of this application has been the inability to translate predictive value between complementary experiments (31). The goal of any cancer classification study is to uncover shared biology that can be used to identify additional cases of cancer tumorigenesis or metastasis. Moreover, the most basic test of any discrimination method should involve the detection of a difference between diseased and normal samples. We used these assumptions to test the effects of probe sequence accuracy in data from two independent breast cancer expression profiling studies preformed on the U95A platform (30, 38) (10). Note we did not address the results from these individual studies, but investigated the effects of probe sequence verification upon the ability to compare multiple patient-related data sets.

To minimize artifacts arising from the experiments being performed at separate facilities (handling, hybridization conditions, scanner settings, etc.), we generated a single expression matrix using RMA. We identified the 1,000 genes that exhibited the highest mean value relative to their standard deviation for unsupervised hierarchical clustering. This filtering strategy was used to minimize the effects of noise or artifacts. Cluster analysis was then performed (Supplemental Fig. S4) for 1) all 1,000 probe sets, 2) the subset of these 1,000 probe sets that were unverified, and 3) the subset of these 1,000 probe sets that were sequence verified. When all 1,000 probe sets were used for clustering, the samples separated into two major nodes each predominantly composed of samples from either study, alone. Moreover, the normal samples were split into three separate groups. This suggested the noise in the system was much greater than the captured underlying biology. Clustering of the samples using only the unverified probe sets produced similar results. However, clustering of the samples using only the verified probe sets produced a striking improvement. First, all normal samples clustered tightly, separated from a majority of tumors. Additionally, although samples still separated into two major nodes, there was significant mixing of the two nodes with samples from both studies. The observed increase in diseased sample overlap (shared biology) and grouping of normal samples as highly similar indicates that restricting data to sequence-verified probes can improve the diagnostic power of microarray technology. This result does not address a particular classification scheme but indicates that removing unverified probe sets allows for the major component of change to be related to the underlying biology of breast cancer as opposed to the source of the experiments.


Although inaccuracies in microarray probe sequences are an appreciated source of experimental noise, little information is available regarding their magnitude and contribution to microarray data variability. In the studies presented here, we have systematically assessed probe sequences on the most commonly used microarray technology. We find that numerous probes fail to correspond with current high-quality definitions of transcribed sequences as defined by the RefSeq database (Fig. 1 and Supplemental Table S1). We chose the RefSeq database, as it is the most highly curated, publicly available genome-wide definition of transcript sequence information. Although it is imperfect, it is likely to be the genomics “gold standard” for sequences of those regions of the genome that are transcribed. As described in results, there are many causes for these probe sequence inaccuracies, but most notably there has been constant improvement in sequence information databases over time. At the time of probe design, the sequences provided in publicly available databases might have been inconclusive (and further complicated due to alternate transcripts). Subsequently, the target sequence may have become outdated as a result of the consolidation of sequence information into RefSeq mRNAs. Regardless of the nature of probe sequence inaccuracies, we clearly show that sequence-verified probes perform more consistently, and with higher accuracy, within replicates and across different versions of the technology. Apparently, Affymetrix has come to the same conclusion and has recently released a platform containing RefSeq-verified probes. We conclude that verification of probe sequences against the best available transcript information is warranted.

As probe sequence inaccuracies would seem to be a likely source for measurement error, we directly tested this possibility. Within large technical replicate experiments (Table 1, and Supplemental Tables S2 and S3), across multiple generations of Affymetrix platforms (Table 2 and Supplemental Table S4) and across Affymetrix and (Agilent) cDNA technologies (Table 2), we consistently found significantly increased measurement accuracy for sequence-verified probe sets. We classified probe sets as completely verified, partially verified, or completely unverified using the individual probe verification information (Fig. 1, Supplemental Fig. S1, and Supplemental Table S1). For analysis of measurement accuracy, we considered probe sets verified if they included a single verified probe, as it is likely that the RefSeq database is not exhaustive and some transcript sequences are truncated. Further limiting verified probe sets to completely verified probes would likely increase the benefit of sequence verification but at the cost of excluding another ∼10–20% of the data set. We tested measurement accuracy for verified probe sets and compared them to the accuracy of unverified probe sets; this allowed for a quantification of “quality” and statistical comparison of these independent sets. Compared with the entire data set (all probe sets), sequence-verified probe sets consistently showed higher correlations. Therefore, the improvement in data accuracy, when filtering for verified probe sets, is directly related to the number of unverified probe sets. Although the benefit of sequence verification can be either modest or large, dependent upon the platform used, the rationale for including data derived from inaccurate probes is unclear. The logical conclusion of these studies is that interrogation of the most accurate sequences generates the most accurate data.

The utility of using multiple probes (probe sets) for monitoring individual gene expression has been refined by the implementation of probe-level normalization methods such as those utilized by dChip and RMA (12, 16). In theory, inaccuracy of individual probe measurements could be compensated for by these algorithms. However, our data clearly show that these methods do not completely compensate for probe sequence inaccuracies and that sequence verification adds additional benefit to microarray data analysis. Another simple explanation for the benefit of sequence verification would be that unverified probe sets showed lower signal intensities, due to failure to accurately measure any transcript, and that removal of unverified probe sets was equivalent to removing low signal intensity measurements. Others have shown a dependency of measurement accuracy upon signal intensity (24). Indeed, signal intensities for unverified probe sets were less than those for verified probe sets. However, verified probe sets showed greater measurement precision than unverified probe sets, even after removing low signal intensity probe sets (Supplemental Tables S3 and S4). Furthermore, although verified probe sets were more often scored “present” by the MAS5 detection call, nearly half of all unverified probe sets were also present (Supplemental Table S5). Although there is a relationship between verified probe sequences and signal intensity, these data reveal one limitation of simply removing probe sets with low signal intensities from microarray data and further show the benefit of sequence verification independent of thresholding for signal intensity or detection call.

Importantly, we have shown that the benefit of probe sequence verification extends beyond controlled in vitro experimental samples to potential diagnostic and predictive applications of microarrays (Supplemental Fig. S4). Unsupervised clustering of two independent human breast cancer data sets resulted in nearly complete separation of all patients from each study. A possible explanation could be that the breast cancer entities are distinct and related to different, regional biological mechanisms. A more plausible explanation for this result is that the signal-to-noise ratio of the data is insufficient to reliably capture the consistent underlying biology. Our data clearly show that sequence-verified probe sets improve the capture of the underlying biology as evidenced by the improved grouping of normal samples and improved mixing of samples from the two data sets. Unfortunately, since no clear subclassification for breast cancer exists, it is difficult to prove the clustering is “better.” However, the increased clustering of normal samples, which represent a distinct group, using verified probes supports this conclusion.

We are not the first to use probe sequence-based information to assess microarray data accuracy. For instance, Tan et. al. (34) reported increased consistency of replicate measurements across Affymetrix, Agilent, and Amersham technologies. Sorlie et al. (31) used UniGene-matched probes to combine information from Affymetrix and Stanford cDNA microarrays in an effort to improve breast cancer classification. In all cases, postexperiment filtering using sequence information improves data quality. As combining data from multiple microarray platforms/technologies is certain to prove a common method, our results showing increased accuracy of sequence-verified probes across platforms (oligo vs. oligo and oligo vs. cDNA) substantiate the importance of using the most reliable information to verify equivalence of measurement across technologies. This can be facilitated by using the probe mapping files available at, which includes lists of verified and unverified probe sets for each Affymetrix platform described in this study, as well as additional information (Supplemental Fig. S2) regarding the location of individual probes within RefSeqs. Alternatively, we encourage end-user verification with the most recent, publicly available sequence information.


This work was supported by the Harvard Lung Biology Center, National Institutes of Health Grants HL-71885 (to T. J. Mariani) and ES-11597-01 (to Y. Sadovsky), and by the Francis Families Foundation.


We thank Drs. S. Shapiro, L. Kunkel, M. Ramoni and A. Butte for helpful suggestions.

Present address of D. Z. Wetmore: Neuroscience Graduate Program, Stanford University School of Medicine, Stanford, CA 94305.


  • 1 The Supplementary Material for this article (Supplemental Tables S1–S5 and Supplemental Figs. S1–S4) is available online at

  • Article published online before print. See web site for date of publication (

    Address for reprint requests and other correspondence: T. J. Mariani, Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital, Harvard Medical School, 75 Francis St., Boston, MA 02115 (E-mail: tmariani{at}



View Abstract