Withdraw
Loading…
On the study of various methods for domain generalization and inverse problems
Shen, Hongyu
Loading…
Permalink
https://hdl.handle.net/2142/129169
Description
- Title
- On the study of various methods for domain generalization and inverse problems
- Author(s)
- Shen, Hongyu
- Issue Date
- 2025-03-11
- Director of Research (if dissertation) or Advisor (if thesis)
- Zhao, Zhizhen
- Doctoral Committee Chair(s)
- Zhao, Zhizhen
- Committee Member(s)
- Huerta, Eliu A
- Hasegawa-Johnson, Mark
- Do, Minh N
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Feature Selection
- FDR Controllability
- Domain Generalization
- Subpopulation Shift
- Gravitational Wave Analysis
- Time Series
- Contrastive Learning
- Protein Sequence Modeling
- Abstract
- In this dissertation, several topics are discussed, categorized into two primary fields of study: domain generalization (Parts II, III, and IV) and inverse problems (Parts IV and V). Part I (Chapter 1) provides a general introduction to the topics covered in this dissertation. Part II (Chapters 2 and 3) focuses on two feature selection algorithms. Part III (Chapter 5) introduces a dataset bias analysis framework aimed at improving test performance in the presence of subpopulations in the training set. Part IV (Chapters 6 and 7) presents two applications based on contrastive learning: one for gravitational wave parameter estimation and the other for protein sequence modeling. Part V (Chapter 8) discusses a method for gravitational wave denoising. Chapter 2 introduces a feature selection algorithm designed to enhance differential metabolite identification (DMI) in metabolomics. Traditional machine learning and deep learning models struggle with challenges such as data nonlinearity, high dimensionality, and varying abundance levels. To address these, we propose the Deep Feature Selector (DFS), a deep-learning-based feature selection model that improves accuracy in identifying differential metabolites. We validate the model on both simulated and real-world datasets, benchmarking it against state-of-the-art models and evaluating performance using true positive rate (TPR) and false discovery rate (FDR) metrics. Case studies on inflammatory bowel disease and Arabidopsis thaliana’s systemic acquired resistance demonstrate the model’s capability in realistic settings, showing significant improvements over benchmark models. Chapter 3 discusses a false discovery rate (FDR)-controlled feature selection algorithm based on the Model-X knockoff framework. While Model-X knockoff can control the FDR, existing deep-learning-based extensions face limitations, particularly in maintaining the swap property, which affects selection power for small, non-Gaussian datasets. To address this, we developed Deep Dependency Regularized Knockoff (DeepDRK), a distribution-free deep learning model that balances FDR control and selection power. Formulated as a learning problem under multi-source adversarial attacks, DeepDRK introduces a novel perturbation technique, achieving lower FDR and higher power compared to existing benchmarks. It performs exceptionally well on synthetic, semi-synthetic, and real-world datasets, especially for small sample sizes and complex data distributions. Chapter 4 considers alternative FDR control frameworks—Gaussian mirror and data splitting. It addresses a limitation in mirror statistics, a core component of data splitting methods, which enforces a unit variance constraint that reduces selection power. To overcome this, we introduce the Generalized Gaussian Mirror (G2M), which incorporates variance information into test statistics. We theoretically and empirically demonstrate that G2M achieves higher selection power than Gaussian mirror and data splitting, maintaining FDR control across synthetic, semi-synthetic, and real datasets. These findings highlight its potential for practical applications. Chapter 5 presents a dataset bias analysis framework to address the subpopulation problem in machine learning datasets. Standard Empirical Risk Minimization (ERM) struggles when data contain spurious correlations or hidden attributes, leading to biased models. Existing methods focus on group-balanced or worst-group accuracy, often sacrificing average accuracy. In this chapter, we introduce importance sampling as a simple yet powerful solution, providing a new theoretical formulation of subpopulation problems. We clarify unstated assumptions in prior works, discuss the root causes of accuracy trade-offs, and propose a single estimator that works in both attribute-known and attribute-unknown scenarios. Our method achieves state-of-the-art performance on benchmark datasets. Chapter 6 introduces deep learning models for estimating parameters of binary black hole mergers: (m1, m2, af, ωR, ωI). Our neural networks integrate a modified WaveNet architecture, contrastive learning, and normalizing flow. We validate the models against a Gaussian conjugate prior family, confirming their statistical consistency. Applying the models to five binary black hole events (GW150914, GW170104, GW170814, GW190521, and GW190630), we compare the deep-learning-derived posteriors with Bayesian methods using PyCBC Inference. The models produce physically consistent results while remaining robust to varying noise statistics, demonstrating broad applicability. Chapter 7 proposes a contrastive-learning-based fine-tuning technique for domain generalization in protein sequence models. Protein data present unique challenges, as small modifications can drastically alter function. Unlike other domains, standard data augmentations are ineffective for proteins. We empirically explore string manipulations for augmenting protein sequences and fine-tuning semi-supervised Transformer-based models. Through 276 comparisons on the Tasks Assessing Protein Embeddings (TAPE) benchmark, we find that contrastive learning consistently outperforms masked-token prediction, particularly with domain-specific transformations such as amino acid replacement. Surprisingly, even random sequence shuffling enhances performance in some tasks. Chapter 8 presents a time series denoising algorithm for gravitational wave data. Traditional Recurrent Neural Networks (RNNs) and Denoising Auto-Encoders (DAEs) perform well in high signal-to-noise ratio (SNR) scenarios but struggle with non-Gaussian, non-stationary noise. To address this, we developed Enhanced Deep Recurrent Denoising Auto-Encoder (EDRDAE), which integrates a signal amplifier layer and employs curriculum learning—starting with high-SNR signals and gradually introducing noisier data. EDRDAE achieves superior denoising performance, generalizing well to complex signal topologies and outperforming existing gravitational wave denoising algorithms.
- Graduation Semester
- 2025-05
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129169
- Copyright and License Information
- Copyright 2025 Hongyu Shen
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…