A comparison of three variant calling pipelines  using simulated data

Nguyen Van Tung; Nguyen Thi Kim Lien; Nguyen Huy Hoang

doi:10.15625/2615-9023/16006

Author affiliations

Authors

Nguyen Van Tung Institute of Genome Research, VAST, Vietnam https://orcid.org/0000-0003-4624-5567
Nguyen Thi Kim Lien Institute of Genome Research, VAST, Vietnam
Nguyen Huy Hoang Graduate University of Science and Technology, VAST, Vietnam

DOI:

https://doi.org/10.15625/2615-9023/16006

Keywords:

Bcftools, GATK, Simulated data, Variant calling, VarScan, Dwgsim

Abstract

Advances in next generation sequencing allow us to do DNA sequencing rapidly at a relatively low cost. Multiple bioinformatics methods have been developed to identify genomic variants from whole genome or whole exome sequencing data. The development of better variant calling methodologies is limited by the difficulty of assessing the accuracy and completeness of a new method. Normally, computational methods can be benchmarked using simulated data which allows us to generate as much data as desired and under controlled scenarios. In this study, we compared three variant calling pipelines: Samtools/VarScan, Samtools/Bcftools, and Picard/GATK using two simulated datasets. The result showed a significant difference between the three pipelines in two cases. In Chromosome 6 dataset, GATK and Bcftools pipelines detected more than 90% of variants. Meanwhile, only 82.19% of mutations were detected by VarScan. In NA12878 datasets, the result showed GATK pipeline was more sensitive than Bcftools and Varscan pipeline. All pipelines showed a high Positive Predictive Value. Moreover, by a measure of run time, VarScan was the highest pipeline but GATK has an option for multithreading which is a way to make a program run faster. Therefore, GATK is more effective than Bcftools and Varscan to variant calling with a lower coverage dataset.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

References

DePristo M. A., Banks E., Poplin R., Garimella K. V., Maguire J. R., Hartl C., Philippakis A. A., del Angel G., Rivas M. A., Hanna M., McKenna A., Fennell T. J., Kernytsky A. M., Sivachenko A. Y., Cibulskis K., Gabriel S. B., Altshuler D., Daly M. J., 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet., 43: 491–498.

Ewing B., Green P., 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res., 8: 186–194.

Ewing B., Hillier L., Wendl M. C., Green P., 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res., 8: 175–185.

Iqbal Z., Caccamo M., Turner I., Flicek P., McVean G., 2012. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet., 44: 226–232.

Koboldt D. C., Chen K., Wylie T., Larson D. E., McLellan M. D., Mardis E. R., Weinstock G. M., Wilson R. K., Ding L., 2009. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinforma. Oxf. Engl., 25: 2283–2285.

Koboldt D. C., Larson D. E., Wilson R. K., 2013. Using VarScan 2 for Germline Variant Calling and Somatic Mutation Detection. Curr. Protoc. Bioinforma. Ed. Board Andreas Baxevanis Al 44: 15.4.1-15.4.17.

Langmead B., Salzberg S. L., 2012. Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9: 357–359.

Li H., 2014. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinforma. Oxf. Engl., 30: 2843–2851.

Li H., 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv13033997 Q-Bio.

Li H., 2012. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinforma. Oxf. Engl., 28: 1838–1844.

Li H., Durbin, R., 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25: 1754–1760.

Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup, 2009. The Sequence Alignment/Map format and SAMtools. Bioinforma. Oxf. Engl., 25: 2078–2079.

Li R., Yu C., Li Y., Lam T. W., Yiu S.

M., Kristiansen K., Wang J., 2009. SOAP2: an improved ultrafast tool for short read alignment. Bioinforma. Oxf. Engl., 25: 1966–1967.

Meyer L. R., Zweig A. S., Hinrichs A. S., Karolchik D., Kuhn R. M., Wong M., Sloan C. A., Rosenbloom K. R., Roe G., Rhead B., Raney B. J., Pohl A., Malladi V. S., Li C. H., Lee B. T., Learned K., Kirkup V., Hsu F., Heitner S., Harte R. A., Haeussler M., Guruvadoo L., Goldman M., Giardine B. M., Fujita P. A., Dreszer T. R., Diekhans M., Cline M. S., Clawson H., Barber G. P., Haussler D., Kent W. J., 2013. The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res., 41: D64–D69.

Narasimhan V., Danecek P., Scally A., Xue Y., Tyler-Smith C., Durbin R., 2016. BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from next-generation sequencing data. Bioinforma. Oxf. Engl., 32: 1749–1751.

Song K., Li L., Zhang G., 2016. Coverage recommendation for genotyping analysis of highly heterologous species using next-generation sequencing technology. Sci. Rep., 6: 35736.

Sudmant P. H., Rausch T., Gardner E. J., Handsaker R. E., Abyzov A., Huddleston J., Zhang Y., Ye K., Jun G., Hsi-Yang Fritz M., Konkel M. K., Malhotra A., Stütz A. M., Shi X., Paolo Casale F., Chen J., Hormozdiari F., Dayama G., Chen K., Malig M., Chaisson M. J. P., Walter K., Meiers S., Kashin S., Garrison E., Auton A., Lam H. Y. K., Jasmine Mu X., Alkan C., Antaki D., Bae T., Cerveira E., Chines P., Chong Z., Clarke L., Dal E., Ding L., Emery S., Fan X., Gujral M., Kahveci F., Kidd J. M., Kong Y., Lameijer E. W., McCarthy S., Flicek P., Gibbs R. A., Marth G., Mason C. E., Menelaou A., Muzny D. M., Nelson B. J., Noor A., Parrish N. F., Pendleton M., Quitadamo A., Raeder B., Schadt E. E., Romanovitch M., Schlattl A., Sebra R., Shabalin A. A., Untergasser A., Walker J. A., Wang M., Yu F., Zhang C., Zhang J., Zheng-Bradley X., Zhou W., Zichner T., Sebat J., Batzer M. A., McCarroll S. A., The 1000 Genomes Project Consortium, Mills R. E., Gerstein M. B., Bashir A., Stegle O., Devine S. E., Lee C., Eichler E. E., Korbel J. O., 2015. An integrated map of structural variation in 2,504 human genomes. Nature, 526: 75–81.

Tian S., Yan H., Neuhauser C., Slager S. L., 2016. An analytical workflow for accurate variant discovery in highly divergent regions. BMC Genomics, 17(1): 703.

Van der Auwera G. A., Carneiro M. O., Hartl C., Poplin R., Del Angel G., Levy-Moonshine A., Jordan T., Shakir K., Roazen D., Thibault J., Banks E., Garimella K. V., Altshuler D., Gabriel S., DePristo M. A., 2013. From FastQ data to high confidence variant calls: the genome analysis Toolkit best practices pipeline. Curr. Protoc. Bioinforma., 43: 11.10.1–11.10.33.

Weisenfeld N. I., Yin S., Sharpe T., Lau B., Hegarty R., Holmes L., Sogoloff B., Tabbaa D., Williams L., Russ C., Nusbaum C., Lander E. S., MacCallum I., Jaffe D. B., 2014. Comprehensive variation discovery in single human genomes. Nat. Genet., 46: 1350–1355.

Wu L., Yavas G., Hong H., Tong W., Xiao W., 2017. Direct comparison of performance of single nucleotide variant calling in human genome with alignment-based and assembly-based approaches. Sci. Rep., 7: 10963.