Withdraw
Loading…
Multiple sequence recovery: theory and applications
Levick, Kel
This item's files can only be accessed by the System Administrators group.
Permalink
https://hdl.handle.net/2142/132807
Description
- Title
- Multiple sequence recovery: theory and applications
- Author(s)
- Levick, Kel
- Issue Date
- 2025-12-09
- Director of Research (if dissertation) or Advisor (if thesis)
- Shomorony, Ilan
- Doctoral Committee Chair(s)
- Shomorony, Ilan
- Committee Member(s)
- Hajek, Bruce
- Milenkovic, Olgica
- Slizovskiy, Ilya
- Veeravalli, Venugopal
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- information theory
- coding
- bioinformatics
- computational biology
- DNA storage
- Abstract
- In this dissertation, we seek to understand the feasibility of recovering multiple sequences from a set of observations containing information about the source sequences, referred to here as the multiple sequence recovery problem. As we will demonstrate, this problem plays an imperative role in areas such as DNA coding for data storage and computational biology. Since many algorithms for DNA assembly and error correction use k-mers, or substrings of length k, of input sequences, we first consider the task of reconstructing multiple source strings from the complete collection of their k-mers; specifically, under what conditions on k and string length n can m strings be uniquely reconstructed in the asymptotic regime. Our results nearly fully characterize the feasibility of this problem, and an intriguing discovery we make is that the feasible region overlaps with the “repeat-abundant region,” i.e., where k-mers occur multiple times within and between source strings with probability approaching 1. Next, we examine a coded scenario, where a set of M strings of length L are treated as a single codeword from a known codebook, and each string is corrupted by erasures and observed (i.e., drawn with replacement) a random number of times. We analyze both the single-draw and multi-draw versions of this channel model. In the single-draw version, the problem reduces to attempting to assemble the strings in the same order as the source codeword. However, the multi-draw version involves the challenge of determining the set of noisy reads corresponding to each of the input strings to attempt to eliminate erasures. We find a generalized expression for the capacity of both cases and prove that linear coding schemes achieve that capacity. The last part of this work tackles the problem of haplotype recovery, attempting to characterize the antibiotic resistome of a metagenomic sample. We begin by proving the statistical consistency of two different estimators for the set of haplotype sequences: the maximum likelihood estimator and a clustering method that minimizes the worst-case bit error across all haplotype clusters. Finally, we construct and demonstrate two practical methods for determining the resistome of real samples.
- Graduation Semester
- 2025-12
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/132807
- Copyright and License Information
- Copyright 2025 Kel Levick
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…