Files in this item

FilesDescriptionFormat

application/pdf

application/pdfMOON-DISSERTATION-2019.pdf (4MB)
(no description provided)PDF

Description

Title:Entropy-based machine learning algorithms applied to genomics and pattern recognition
Author(s):Moon, Wooyoung
Director of Research:Song, Jun S.
Doctoral Committee Chair(s):Dahmen, Karin
Doctoral Committee Member(s):Kuehn, Seppe; Draper, Patrick
Department / Program:Physics
Discipline:Physics
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):Machine Learning, Decision Trees, Convolutional Filters, Genomics, Cancer, Entropy
Abstract:Transcription factors (TF) are proteins that interact with DNA to regulate the transcription of DNA to RNA and play key roles in both healthy and cancerous cells. Thus, gaining a deeper understanding of the biological factors underlying transcription factor (TF) binding specificity is important for understanding the mechanism of oncogenesis. As large, biological datasets become more readily available, machine learning (ML) algorithms have proven to make up an important and useful set of tools for cancer researchers. However, there remain many areas for potential improvements for these ML models, including a higher degree of model interpretability and overall accuracy. In this thesis, we present decision tree (DT) methods applied to DNA sequence analysis that result in highly interpretable and accurate predictions. We propose a boosted decision tree (BDT) model using the binary counts of important DNA motifs to predict the binding specificity of TFs belonging to the same protein family of binding similar DNA sequences. We then proceed to introduce a novel application of Convolutional Decision Trees (CDT) and demonstrate that this approach has distinct advantages over the BDT modeil while still accurately predicting the binding specificty of TFs. The CDT models are trained using the Cross Entropy (CE) optimization method, a Monte Carlo optimization method based on concepts from information theory related to statistical mechanics. We then further study the CDT model as a general pattern recognition and transfer learning technique and demonstrate that this approach can learn translationally invariant patterns that lead to high classification accuracy while remaining more interpretable and learning higher quality convolutional filters compared to convolutional neural networks (CNN).
Issue Date:2019-04-16
Type:Text
URI:http://hdl.handle.net/2142/104838
Rights Information:Copyright 2019 Wooyoung Moon
Date Available in IDEALS:2019-08-23
Date Deposited:2019-05


This item appears in the following Collection(s)

Item Statistics