Files in this item



application/pdfMOLLOY-DISSERTATION-2020.pdf (1MB)
(no description provided)PDF


Title:Supertree-like methods for genome-scale species tree estimation
Author(s):Molloy, Erin Katherine
Director of Research:Warnow, Tandy; Gropp, William
Doctoral Committee Chair(s):Warnow, Tandy
Doctoral Committee Member(s):Snir, Marc; Nakhleh, Luay
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
species tree estimation
Multi-Species Coalescent model
gene duplication and loss
Abstract:A critical step in many biological studies is the estimation of evolutionary trees (phylogenies) from genomic data. Of particular interest is the species tree, which illustrates how a set of species evolved from a common ancestor. While species trees were previously estimated from a few regions of the genome (genes), it is now widely recognized that biological processes can cause the evolutionary histories of individual genes to differ from each other and from the species tree. This heterogeneity across the genome is phylogenetic signal that can be leveraged to estimate species evolution with greater accuracy. Hence, species tree estimation is expected to be greatly aided by current large-scale sequencing efforts, including the 5000 Insect Genomes Project, the 10000 Plant Genomes Project, the (~60000) Vertebrate Genomes Project, and the Earth BioGenome Project, which aims to assemble genomes (or at least genome-scale data) for 1.5 million eukaryotic species in the next ten years. To analyze these forthcoming datasets, species tree estimation methods must scale to thousands of species and tens of thousands of genes; however, many of the current leading methods, which are heuristics for NP-hard optimization problems, can be prohibitively expensive on datasets of this size. In this dissertation, we argue that new methods are needed to enable scalable and statistically rigorous species tree estimation pipelines; we then seek to address this challenge through the introduction of three supertree-like methods: NJMerge, TreeMerge, and FastMulRFS. For these methods, we present theoretical results (worst-case running time analyses and proofs of statistical consistency) as well as empirical results on simulated datasets (and a fungal dataset for FastMulRFS). Overall, these methods enable statistically consistent species tree estimation pipelines that achieve comparable accuracy to the dominant optimization-based approaches while dramatically reducing running time.
Issue Date:2020-06-29
Rights Information:Copyright 2020 Erin Molloy
Date Available in IDEALS:2020-10-07
Date Deposited:2020-08

This item appears in the following Collection(s)

Item Statistics