Purpose
To detect contamination of genomic DNA samples used in next-generation sequencing applications. To take advantage of the abundance of third party sequencing solutions, it is important to be able to ensure that any variations detected result from the correct patient sample. Known genotype fingerprints can help validate sample identity. But additional quantitative measures are required to ensure sample integrity.
Methods
The Genome Analysis Toolkit (GATK) from the Broad Institute was used to call variations. The relative number of supporting reads (supporting / cover) was calculated for each variation. The distribution of the relative number of reads supporting each variant was compared to a distribution derived from a cohort of control samples. Contamination was detected as an increase in variations with a relative number of supporting reads below 35%.
Results
We have developed and implemented a systematic approach for identifying contamination in samples used in next-generation sequencing experiments. The distribution of relative supporting reads for a few dozen exomes is shown in Figure 1 below. Non-contaminated samples are shown with solid black lines. Contamination presents as a substantial and distinctive increase in the fraction of variations found below 50%. Two samples (large dashed lines) are clearly contaminated, and two other samples (small dashed lines) exhibit an indication of potential contaminated. We are actively evaluating exomes from several large whole-exome sequencing projects. Together with our collaborators we will be validating samples that appear contaminated to evaluate our algorithm’s specificity and sensitivity.
Conclusions
We have developed a simple method for identifying contaminated samples in exome sequencing experiments. Further research in this area is needed to determine the power of this method in identifying and quantifying the extent of contamination, and the amount of contamination that can be tolerated without compromising accuracy.
Keywords: 604 mutations •
467 clinical laboratory testing •
473 computational modeling