Accurate indel prediction using paired-end short readsReportar como inadecuado

Accurate indel prediction using paired-end short reads - Descarga este documento en PDF. Documentación en PDF para descargar gratis. Disponible también para leer online.

BMC Genomics

, 14:132

Plant genomics


BackgroundOne of the major open challenges in next generation sequencing NGS is the accurate identification of structural variants such as insertions and deletions indels. Current methods for indel calling assign scores to different types of evidence or counter-evidence for the presence of an indel, such as the number of split read alignments spanning the boundaries of a deletion candidate or reads that map within a putative deletion. Candidates with a score above a manually defined threshold are then predicted to be true indels. As a consequence, structural variants detected in this manner contain many false positives.

ResultsHere, we present a machine learning based method which is able to discover and distinguish true from false indel candidates in order to reduce the false positive rate. Our method identifies indel candidates using a discriminative classifier based on features of split read alignment profiles and trained on true and false indel candidates that were validated by Sanger sequencing. We demonstrate the usefulness of our method with paired-end Illumina reads from 80 genomes of the first phase of the 1001 Genomes Project in Arabidopsis thaliana.

ConclusionIn this work we show that indel classification is a necessary step to reduce the number of false positive candidates. We demonstrate that missing classification may lead to spurious biological interpretations. The software is available at:

KeywordsNext generation sequencing Indel detection Discriminative machine learning Paired-end short reads Split-read mapping AbbreviationsNGSNext Generation Sequencing

SVMSupport Vector Machine

MURmapped-unmapped read pair

SVstructural variant

SNPsingle nucleotide polymorphism

GWAgenome wide association


PEMpaired-end mapping

SRMsplit-read mapping

UMRuniquely mapped read

N-UMRnon-uniquely mapped read

SPVsingle position variant

AUCarea under the curve


PCAprinciple component analyses

PCprinciple component

ROCreceiver operation characteristic

TNtrue negative

TPtrue positive

FNfalse negative

FPfalse positive

TPRtrue positive rate

TNRtrue negative rate.

Electronic supplementary materialThe online version of this article doi:10.1186-1471-2164-14-132 contains supplementary material, which is available to authorized users.

Dominik Grimm, Jörg Hagmann contributed equally to this work.

Download fulltext PDF

Autor: Dominik Grimm - Jörg Hagmann - Daniel Koenig - Detlef Weigel - Karsten Borgwardt


Documentos relacionados