Multiple sequence recovery: theory and applications

Levick, Kel

Multiple sequence recovery: theory and applications

Levick, Kel

This item's files can only be accessed by the System Administrators group.

Permalink

https://hdl.handle.net/2142/132807

Description

Title

Multiple sequence recovery: theory and applications

Author(s)

Levick, Kel

Issue Date

2025-12-09

Director of Research (if dissertation) or Advisor (if thesis)

Shomorony, Ilan

Doctoral Committee Chair(s)

Shomorony, Ilan

Committee Member(s)

Hajek, Bruce
Milenkovic, Olgica
Slizovskiy, Ilya
Veeravalli, Venugopal

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

information theory
coding
bioinformatics
computational biology
DNA storage

Language

eng

Abstract

In this dissertation, we seek to understand the feasibility of recovering multiple sequences from a set of observations containing information about the source sequences, referred to here as the multiple sequence recovery problem. As we will demonstrate, this problem plays an imperative role in areas such as DNA coding for data storage and computational biology. Since many algorithms for DNA assembly and error correction use k-mers, or substrings of length k, of input sequences, we first consider the task of reconstructing multiple source strings from the complete collection of their k-mers; specifically, under what conditions on k and string length n can m strings be uniquely reconstructed in the asymptotic regime. Our results nearly fully characterize the feasibility of this problem, and an intriguing discovery we make is that the feasible region overlaps with the “repeat-abundant region,” i.e., where k-mers occur multiple times within and between source strings with probability approaching 1. Next, we examine a coded scenario, where a set of M strings of length L are treated as a single codeword from a known codebook, and each string is corrupted by erasures and observed (i.e., drawn with replacement) a random number of times. We analyze both the single-draw and multi-draw versions of this channel model. In the single-draw version, the problem reduces to attempting to assemble the strings in the same order as the source codeword. However, the multi-draw version involves the challenge of determining the set of noisy reads corresponding to each of the input strings to attempt to eliminate erasures. We find a generalized expression for the capacity of both cases and prove that linear coding schemes achieve that capacity. The last part of this work tackles the problem of haplotype recovery, attempting to characterize the antibiotic resistome of a metagenomic sample. We begin by proving the statistical consistency of two different estimators for the set of haplotype sequences: the maximum likelihood estimator and a clustering method that minimizes the worst-case bit error across all haplotype clusters. Finally, we construct and demonstrate two practical methods for determining the resistome of real samples.

Graduation Semester

2025-12

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/132807

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

Multiple sequence recovery: theory and applications

Levick, Kel

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In