Files in this item



application/pdfGUPTA-THESIS-2016.pdf (1MB)
(no description provided)PDF


Title:Improving gene trees without more data
Author(s):Gupta, Ashu
Advisor(s):Warnow, Tandy
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Gene trees
Species trees
Multi-locus bootstrapping (MLBS)
Gene tree estimation
Species tree estimation
Low phylogenetic signal
Abstract:Species tree and gene tree estimation from sequence data are two steps in many biological analyses. Computational challenges and limited amount of data often make estimating highly accurate phylogenetic trees a difficult task. Moreover, gene alignments used to estimate trees on individual loci often have low phylogenetic signal (e.g., short alignment length), resulting in poorly estimated gene trees. Species tree estimation on the other hand is challenged by individual loci having different evolutionary histories caused by a biological phenomenon known as incomplete lineage sorting (ILS). In the presence of ILS, summary methods like MP-EST, ASTRAL2, and ASTRID are often used to estimate the species tree from gene trees. Summary methods operate by combining estimated gene trees and thus suffer in the presence of low phylogenetic signal. To tackle this problem the Statistical Binning and Weighted Statistical Binning pipelines were designed to improve gene tree estimation, which in turn can improve species tree estimation. Experimental studies of these pipelines revealed that they helped in improving gene tree and species tree estimation. However, these studies only tested the weighted statistical binning and statistical binning pipelines using multi-locus bootstrapping (MLBS) and not using BestML, where MLBS and BestML are different ways to run a phylogenetic pipeline. In this thesis, a novel phylogenetic pipeline named WSB+WQMC is proposed. This pipeline shares several design features with the weighted statistical binning pipeline (referred as WSB+CAML in this thesis) but has some other desirable properties. The WSB+WQMC pipeline is also shown to be statistically consistent under the GTR+MSC model when a slightly different version of WQMC is used. In this study WSB+WQMC was evaluated and compared with the WSB+CAML pipeline on various simulated datasets using BestML analysis. Most of the trends seen in MLBS analyses were also observed for WSB+WQMC and WSB+CAML in BestML analyses with some important differences. It is shown that WSB+WQMC substantially improved the accuracy of gene tree and species tree estimation using ASTRAL2 and ASTRID on most datasets having low, medium, and moderately high levels of ILS. Compared to WSB+CAML, it was found that WSB+WQMC computed less accurate gene trees and species trees in certain model conditions having low and medium levels of ILS. However, WSB+WQMC was found to be better and at least as accurate as WSB+CAML in computing gene trees and species trees on all datasets having moderately high and high ILS levels. WSB+WQMC is also shown to be better in estimating gene trees on certain medium and low ILS datasets. Thus, WSB+WQMC is a potential alternative to WSB+CAML for gene tree and species tree estimation in the presence of low phylogenetic signal.
Issue Date:2016-04-28
Rights Information:Copyright 2016 Ashu Gupta
Date Available in IDEALS:2016-07-07
Date Deposited:2016-05

This item appears in the following Collection(s)

Item Statistics