Withdraw
Loading…
Harnessing AI/ML for advancing synthetic biology
Yu, Tianhao
This item's files can only be accessed by the System Administrators group.
Permalink
https://hdl.handle.net/2142/129651
Description
- Title
- Harnessing AI/ML for advancing synthetic biology
- Author(s)
- Yu, Tianhao
- Issue Date
- 2025-04-30
- Director of Research (if dissertation) or Advisor (if thesis)
- Zhao, Huimin
- Doctoral Committee Chair(s)
- Zhao, Huimin
- Committee Member(s)
- Rao, Christopher V.
- Shukla, Diwakar
- Liu, Ge
- Department of Study
- Chemical & Biomolecular Engr
- Discipline
- Chemical Engineering
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Machine learning
- synthetic biology
- protein engineering
- language model
- Abstract
- The intersection of synthetic biology and artificial intelligence has created unique opportunities for understanding and engineering biological systems. This dissertation presents a suite of machine learning (ML) approaches tailored for protein and small molecule discovery, with an emphasis on integrating biological data and experimental validation. Spanning enzyme annotation, protein engineering, antimicrobial compound discovery, and multimodal representation learning, the research showcases how modern ML/AI methods can overcome key bottlenecks in synthetic biology and accelerate the design–build–test–learn cycle. Chapter 1 introduces the foundational concepts of synthetic biology, protein engineering, and ML. It highlights the recent proliferation of protein language models and their ability to extract biologically meaningful patterns from sequence data. The chapter also outlines the four major research threads explored in the thesis: enzyme function prediction using contrastive learning, ML-guided protein engineering, large language model-driven antimicrobial molecule discovery, and multimodal representation learning for enzyme annotation. Chapter 2 describes the development of CLEAN (Contrastive Learning-enabled Enzyme Annotation), a deep learning framework for annotating enzyme functions based solely on amino acid sequences. CLEAN utilizes contrastive learning to create EC number-aware protein representations by training on functionally similar and dissimilar enzyme pairs. The model significantly outperforms traditional similarity-based annotation tools and other ML-based tools on benchmark datasets, especially in retrieving understudied or misannotated enzymes. CLEAN also demonstrated practical utility through the experimental validation of novel halogenase candidates, uncovering a promiscuous enzyme with three catalytic activities. Chapter 3 presents a generalized ML framework for protein engineering, integrating zero-shot language model predictions with supervised learning in a closed-loop optimization pipeline. The chapter also introduces ECNet, an evolutionary context-integrated neural network trained to predict variant fitness using homologous sequence data. Further, this chapter describes a fully autonomous protein engineering platform combining ML as decision maker and robotics as experimentalists. This system was used to engineer two industrially relevant enzymes: AtHMT, to enhance ethyltransferase activity for S-adenosylmethionine analog synthesis, and YmPhytase, to broaden pH tolerance for animal feed applications. Both enzymes underwent 4 rounds of prediction, testing, and re-training, exemplifying a robust ML-guided directed evolution loop. The chapter also introduces a GPT-based user interface for protein variant design, lowering the barrier for non-expert users. Chapter 4 shifts focus to small molecules, detailing an LLM-driven framework for predicting antimicrobial activity of compounds. Using a SMILES-based transformer model pretrained on PubChem, the study built a regression pipeline to predict the minimal inhibitory concentration (MIC) of molecules against 12 Gram-negative bacterial species. Screening compound libraries like ZINC, the pipeline led to the identification of Diamiquincin (DAQ), a novel compound with potent activity against Acinetobacter baumannii (MIC = 2 µg/mL). This chapter illustrates the effectiveness of predictive LLMs in streamlining antibiotic discovery. Chapter 5 explores multimodal contrastive learning to improve enzyme function annotation beyond primary sequence data. Recognizing that biological function can be augmented with multiple modality (e.g., literature annotations), the chapter develops a dual-encoder architecture inspired by CLIP. Protein sequences encoded by ESM-2 and textual descriptions encoded by BioGPT are aligned in a shared latent space using InfoNCE-style loss functions. The model demonstrated superior performance over unimodal baselines in EC number prediction on the challenging independent dataset. Additionally, the "sequence-first" loss configuration, balancing between alignment and model simplicity, proved more effective than a fully mixed loss. These findings highlight the power of weak supervision from unstructured text to enhance protein representations, especially in applications where experimental labels are limited. Across all chapters, the thesis emphasizes co-design between computational models and experimental workflows. Each ML method is tightly coupled with wet-lab validation, establishing trust in predictions and enabling iterative improvement. Thematically, the work advances three core capabilities in synthetic biology: (1) understanding protein function at scale, (2) engineering protein function efficiently, and (3) discovering therapeutic compounds in a data-driven manner. Moreover, the models developed in this dissertation are built for generalizability and extensibility, setting the stage for future integration into broader synthetic biology toolkits. In summary, this dissertation demonstrates the potential of AI and ML to accelerate discovery and design in synthetic biology. By combining contrastive learning, large language models, and evolutionary insights, the research addresses long-standing challenges in enzyme annotation, protein engineering, and antibiotic discovery. The approaches outlined here exemplify a new generation of data-driven biology, where machine learning not only interprets biological data but also shapes future experimentation.
- Graduation Semester
- 2025-05
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129651
- Copyright and License Information
- Copyright 2025 Tianhao Yu
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…