Withdraw
Loading…
A new filtering method for improving the quality of variant discovery
Zhang, Chuanyi
Content Files

Loading…
Download Files
Loading…
Download Counts (All Files)
Loading…
Edit File
Loading…
Permalink
https://hdl.handle.net/2142/106361
Description
- Title
- A new filtering method for improving the quality of variant discovery
- Author(s)
- Zhang, Chuanyi
- Issue Date
- 2019-12-02
- Director of Research (if dissertation) or Advisor (if thesis)
- Ochoa, Idoia
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Date of Ingest
- 2020-03-02T22:15:03Z
- Keyword(s)
- Filtering
- VCF file
- Ensemble learning
- Abstract
- Variants identified by current genomic analysis pipelines contain many incorrectly called variants. These can be potentially eliminated by applying state-of-the-art filtering tools, such as Variant Quality Score Recalibration (VQSR) or Hard Filtering (HF). However, these methods are very user-dependent and fail to run in some cases. We propose Variant Ensemble Filter (VEF), a variant filtering tool based on decision tree ensemble methods that overcomes the main drawbacks of VQSR and HF. Contrary to these methods, we treat filtering as a supervised learning problem, using variant call data with known “true” variants, i.e., gold standard, for training. Once trained, VEF can be directly applied to filter the variants contained in a given VCF file (we consider training and testing VCF files generated with the same tools, as we assume they will share feature characteristics). For the analysis, we used Whole Genome Sequencing (WGS) human datasets for which the gold standards are available. We show on these data that the proposed filtering tool Variant Ensemble Filter (VEF) consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, when the training and testing datasets differ in coverage, and when sequencing pipelines other than GATK are used. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared to VQSR (4 versus 50 minutes approximately for filtering the SNPs of a WGS Human sample).
- Graduation Semester
- 2019-12
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/106361
- Copyright and License Information
- Copyright 2019 Chuanyi Zhang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Electrical and Computer Engineering
Dissertations and Theses in Electrical and Computer EngineeringManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…