Withdraw
Loading…
Computational methods for genomic and biomedical data mining
Rana, Vishal
Loading…
Permalink
https://hdl.handle.net/2142/129209
Description
- Title
- Computational methods for genomic and biomedical data mining
- Author(s)
- Rana, Vishal
- Issue Date
- 2025-04-18
- Director of Research (if dissertation) or Advisor (if thesis)
- Milenkovic, Olgica
- Doctoral Committee Chair(s)
- Milenkovic, Olgica
- Committee Member(s)
- Maslov, Sergei
- El-Kebir, Mohammed
- Shomorony, Ilan
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Geomics
- Biomedical Data Mining
- Machine Learning
- Graph Neural Networks
- Large Language Models
- Methylation
- Epigenetics
- Group Testing
- Language
- eng
- Abstract
- The rapid growth of high-throughput sequencing technologies and biomedical data generation has necessitated the development of advanced computational methods to extract meaningful insights from complex biological systems. From understanding gene regulation through multi-omics integration to improving disease detection and biomarker discovery, computational approaches - spanning machine learning, graph representation learning, and algorithmic optimization - are transforming biomedical research. This thesis explores three computational methodologies addressing distinct yet interconnected challenges in biomedical data analysis. First, it introduces novel graph neural network (GNN) models that integrate biological knowledge graphs with large language model (LLM)-derived features to enhance link prediction tasks. Second, it presents a framework for analyzing local DNA methylation patterns at CpG sites, demonstrating their critical role in gene expression regulation and cancer heterogeneity. Finally, it proposes a semi-quantitative group testing (SQGT) method for optimizing pathogen detection efficiency using qPCR data. Collectively, these projects highlight the power of computational strategies in unraveling complex biological patterns, improving disease modeling, and optimizing healthcare interventions. Knowledge graphs (KGs) effectively represent complex relationships between biological entities and in conjunction with graph neural networks (GNNs), they have been successfully used for various link prediction tasks. However, biological knowledge graphs are inherently heterogeneous and usually either do not include highly informative node features or use some type of node features to inform the graph topology. In the latter case, adding the node features back to the graph model does not improve the performance of the model, due to strong attribute-topology correlations approaches. Emerging biomedical large language models (LLMs) can be used to extract numerical node features that complement the graph topology and the original features that informed them, but it remains an open problem to determine how to find LLM features that lead to large additional link prediction performance improvements. We propose several new GNN methods for combining LLM features with existing graph topologies that use global and local graph denoising and rewiring protocols, specialized for multipartite heterogeneous graphs. Our methods evaluate how the graph topology aligns with LLM-generated features, and then only add features that best complement the existing topology. We test the proposed approaches on unattributed BioSNAP, PubChem and UniProt graphs and show that they lead to an average 3.5% improvement in link prediction accuracy. Advancements in sequencing technologies have facilitated the generation of large-scale multi-omics datasets, enabling a deeper understanding of cancer onset and progression. Among these, DNA methylation plays a crucial role in gene regulation and tumorigenesis. While conventional analyses primarily focus on global promoter methylation states, emerging evidence suggests that the methylation status of individual CpG sites significantly influences gene expression. In this study, we introduce a framework that defines local methylation patterns—binary representations of CpG methylation states within gene promoters—to investigate their impact on transcriptional regulation across various cancer types. By analyzing patient samples from The Cancer Genome Atlas (TCGA), we identify distinct methylation patterns associated with differentially expressed genes, despite sharing the same global methylation status. Notably, in lung adenocarcinoma (LUAD), distinct ATM promoter methylation patterns correlate with differential gene expression profiles linked to DNA damage response pathways. Our findings highlight the necessity of moving beyond binary methylation classification to a more granular assessment of CpG-specific modifications. This approach provides new insights into tumor heterogeneity and has the potential to improve biomarker discovery, prognosis, and personalized treatment strategies. Pathogenic infections pose a significant threat to global health, affecting millions of people every year and presenting substantial challenges to healthcare systems worldwide. Efficient and timely testing plays a critical role in disease control and transmission prevention. Group testing is a well-established method for reducing the number of tests needed to screen large populations when the disease prevalence is low. However, it does not fully utilize the quantitative information provided by qPCR methods, nor is it able to accommodate a wide range of pathogen loads. To address these issues, we introduce a novel adaptive semi-quantitative group testing (SQGT) scheme to efficiently screen populations via two stage qPCR testing. The SQGT method quantizes cycle threshold (Ct) values into multiple bins, leveraging the information from the first stage of screening to improve the detection sensitivity. Dynamic Ct threshold adjustments mitigate dilution effects and enhance test accuracy. Comparisons with traditional binary outcome GT methods show that SQGT reduces the number of tests by 24% while maintaining a negligible false negative rate.
- Graduation Semester
- 2025-05
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129209
- Copyright and License Information
- Copyright 2025 Vishal Rana
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Electrical and Computer Engineering
Dissertations and Theses in Electrical and Computer EngineeringManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…