Files in this item



application/pdfNUTE-DISSERTATION-2019.pdf (5MB)
(no description provided)PDF


application/ (49MB)
(no description provided)ZIP


application/ (28MB)
(no description provided)ZIP


Title:Statistical estimation problems in phylogenomics and applications in microbial ecology
Author(s):Nute, Michael Gordon
Director of Research:Warnow, Tandy J
Doctoral Committee Chair(s):Warnow, Tandy J
Doctoral Committee Member(s):Gropp, William; Zhao, Dave; Stumpf, Rebecca; Chen, Yuguo; Pop, Mihai
Department / Program:Statistics
Degree Granting Institution:University of Illinois at Urbana-Champaign
Mutliple Sequence Alignment
Abstract:With the growing awareness of the potential for microbial communities to play a role in human health, environmental remediation and other important processes, the challenge of understanding such a complex population through the lens of high-throughput sequencing output has risen to the fore. For a de novo sequenced community, the first step to understanding the population involves comparing the sequences to a reference database in some form. In this dissertation, we consider some challenges and benefits of organizing the reference data according to evolution, with orthologous genes grouped together and stored as a multiple sequence alignment and phylogenetic tree. First we consider the related problem of estimating the population-level phylogeny of a group of species based on the alignments and phylogenies of several individual genes. Under one common model, species tree estimation is provably statistically consistent by several different methods, but those proofs rely on two separate and potentially shaky assumptions: that every species appears in the data for every gene (i.e., there is no missing data), and that since gene tree estimation is itself consistent, the gene trees used to compute the population-level tree are correct. Second, we explore some novel ways to use a Bayesian MCMC algorithm for jointly estimating alignment and phylogeny. The result is increased accuracy for large alignments, where the MCMC method alone would not be tractable. In the process, we identify a peculiar property of this Bayesian algorithm: it performs much differently on simulated sequences than on sequences from biological alignment benchmarks. No other alignment method tested showed the same divergence. Finally, we present two different practical applications a reference database containing an alignment and tree for a group of gene families in the context of microbial ecology. The first is an algorithm that uses the tree and alignment to construct an ensemble of profile hidden Markov models that improves remote homology detection. The second is a data visualization technique that generates an image of the community with a high density of data, but one that makes it naturally easy to compare many different samples at a time, potentially uncovering otherwise elusive patterns in the data.
Issue Date:2019-04-19
Rights Information:Copyright 2019 Michael Nute
Date Available in IDEALS:2019-11-26
Date Deposited:2019-08

This item appears in the following Collection(s)

Item Statistics