Files in this item

FilesDescriptionFormat

application/pdf

application/pdfWICKLAND-DISSERTATION-2019.pdf (4MB)Restricted to U of Illinois
(no description provided)PDF

Description

Title:Computational methods for genomic variant calling and analysis
Author(s):Wickland, Daniel Paul
Director of Research:Hudson, Matthew E
Doctoral Committee Chair(s):Hudson, Matthew E
Doctoral Committee Member(s):Asmann, Yan W; Mainzer, Liudmila S; Moose, Stephen P; Vodkin, Lila O
Department / Program:Crop Sciences
Discipline:Informatics
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):variant calling, Alzheimer's disease, batch effect, genotyping-by-sequencing
Abstract:The development of short-read, next-generation sequencing (NGS) has revolutionized biological research, agriculture and medicine, enabling innovations such as genomic selection to raise crop yields and precision medicine to diagnose and treat disease. The genetic polymorphisms identified by this high-throughput sequencing can serve as markers for association with phenotypic traits. Variant calling refers to the process of detecting genetic polymorphisms based on analysis of genome sequence data output by NGS technology. The projects described here investigate these analysis methods. Chapter One reviews variant calling and its application to human and plant genomic data. It opens by detailing the generation of sequence reads from biological samples and the conversion of those reads to meaningful data, emphasizing the importance of tool selection for analysis. Next, the use of sequencing to identify genetic risk factors in the context of Alzheimer’s disease is reviewed. The chapter concludes by describing the application of sequencing to analysis of plant genomes. Chapter Two presents a study of the impact of batch effect and study design on identification of genetic risk factors in human sequencing data. Sequencing-based searches for disease-associated variants require large sample sizes to achieve sufficient statistical power, but they often entail batch effects and biases from study design, both of which hinder the ability to detect true genotype-trait associations. We studied batch effects and confounding variables in whole-exome data from the Alzheimer’s Disease Sequencing Project and demonstrated that both significantly impacted the association analysis. In particular, we identified variants with novel disease associations that may have been influenced by population stratification and a confounding effect of age. Chapter Three reports a comparison of genotyping-by-sequencing (GBS) analysis methods on plant data. As a reduced-representation sequencing method to identify genetic variants and quickly genotype samples, GBS produces extensive missing data and requires complex bioinformatics analysis, particularly in the context of plants, which have highly variable ploidy and repeat content. To address issues identified with existing methods, we developed GB-eaSy, a GBS bioinformatics pipeline that incorporates widely used genomics tools, parallelization and automation to increase the accuracy and accessibility of GBS data analysis. A comparison of five GBS pipelines using low-coverage sequence data from soybean demonstrated that GB-eaSy rapidly and accurately identified the greatest number of variants. In addition, the unexpectedly low convergence between the five analysis methods but generally high accuracy indicated that the workflows arrived at largely complementary sets of valid variant calls.
Issue Date:2019-05-24
Type:Text
URI:http://hdl.handle.net/2142/105742
Rights Information:Copyright 2019 Daniel P. Wickland
Date Available in IDEALS:2019-11-26
Date Deposited:2019-08


This item appears in the following Collection(s)

Item Statistics