On the study of various methods for domain generalization and inverse problems

Shen, Hongyu

On the study of various methods for domain generalization and inverse problems

Shen, Hongyu

Permalink

https://hdl.handle.net/2142/129169

Description

Title

On the study of various methods for domain generalization and inverse problems

Author(s)

Shen, Hongyu

Issue Date

2025-03-11

Director of Research (if dissertation) or Advisor (if thesis)

Zhao, Zhizhen

Doctoral Committee Chair(s)

Zhao, Zhizhen

Committee Member(s)

Huerta, Eliu A
Hasegawa-Johnson, Mark
Do, Minh N

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Feature Selection
FDR Controllability
Domain Generalization
Subpopulation Shift
Gravitational Wave Analysis
Time Series
Contrastive Learning
Protein Sequence Modeling

Language

eng

Abstract

In this dissertation, several topics are discussed, categorized into two primary fields of study: domain generalization (Parts II, III, and IV) and inverse problems (Parts IV and V). Part I (Chapter 1) provides a general introduction to the topics covered in this dissertation. Part II (Chapters 2 and 3) focuses on two feature selection algorithms. Part III (Chapter 5) introduces a dataset bias analysis framework aimed at improving test performance in the presence of subpopulations in the training set. Part IV (Chapters 6 and 7) presents two applications based on contrastive learning: one for gravitational wave parameter estimation and the other for protein sequence modeling. Part V (Chapter 8) discusses a method for gravitational wave denoising. Chapter 2 introduces a feature selection algorithm designed to enhance differential metabolite identification (DMI) in metabolomics. Traditional machine learning and deep learning models struggle with challenges such as data nonlinearity, high dimensionality, and varying abundance levels. To address these, we propose the Deep Feature Selector (DFS), a deep-learning-based feature selection model that improves accuracy in identifying differential metabolites. We validate the model on both simulated and real-world datasets, benchmarking it against state-of-the-art models and evaluating performance using true positive rate (TPR) and false discovery rate (FDR) metrics. Case studies on inflammatory bowel disease and Arabidopsis thaliana’s systemic acquired resistance demonstrate the model’s capability in realistic settings, showing significant improvements over benchmark models. Chapter 3 discusses a false discovery rate (FDR)-controlled feature selection algorithm based on the Model-X knockoff framework. While Model-X knockoff can control the FDR, existing deep-learning-based extensions face limitations, particularly in maintaining the swap property, which affects selection power for small, non-Gaussian datasets. To address this, we developed Deep Dependency Regularized Knockoff (DeepDRK), a distribution-free deep learning model that balances FDR control and selection power. Formulated as a learning problem under multi-source adversarial attacks, DeepDRK introduces a novel perturbation technique, achieving lower FDR and higher power compared to existing benchmarks. It performs exceptionally well on synthetic, semi-synthetic, and real-world datasets, especially for small sample sizes and complex data distributions. Chapter 4 considers alternative FDR control frameworks—Gaussian mirror and data splitting. It addresses a limitation in mirror statistics, a core component of data splitting methods, which enforces a unit variance constraint that reduces selection power. To overcome this, we introduce the Generalized Gaussian Mirror (G2M), which incorporates variance information into test statistics. We theoretically and empirically demonstrate that G2M achieves higher selection power than Gaussian mirror and data splitting, maintaining FDR control across synthetic, semi-synthetic, and real datasets. These findings highlight its potential for practical applications. Chapter 5 presents a dataset bias analysis framework to address the subpopulation problem in machine learning datasets. Standard Empirical Risk Minimization (ERM) struggles when data contain spurious correlations or hidden attributes, leading to biased models. Existing methods focus on group-balanced or worst-group accuracy, often sacrificing average accuracy. In this chapter, we introduce importance sampling as a simple yet powerful solution, providing a new theoretical formulation of subpopulation problems. We clarify unstated assumptions in prior works, discuss the root causes of accuracy trade-offs, and propose a single estimator that works in both attribute-known and attribute-unknown scenarios. Our method achieves state-of-the-art performance on benchmark datasets. Chapter 6 introduces deep learning models for estimating parameters of binary black hole mergers: (m1, m2, af, ωR, ωI). Our neural networks integrate a modified WaveNet architecture, contrastive learning, and normalizing flow. We validate the models against a Gaussian conjugate prior family, confirming their statistical consistency. Applying the models to five binary black hole events (GW150914, GW170104, GW170814, GW190521, and GW190630), we compare the deep-learning-derived posteriors with Bayesian methods using PyCBC Inference. The models produce physically consistent results while remaining robust to varying noise statistics, demonstrating broad applicability. Chapter 7 proposes a contrastive-learning-based fine-tuning technique for domain generalization in protein sequence models. Protein data present unique challenges, as small modifications can drastically alter function. Unlike other domains, standard data augmentations are ineffective for proteins. We empirically explore string manipulations for augmenting protein sequences and fine-tuning semi-supervised Transformer-based models. Through 276 comparisons on the Tasks Assessing Protein Embeddings (TAPE) benchmark, we find that contrastive learning consistently outperforms masked-token prediction, particularly with domain-specific transformations such as amino acid replacement. Surprisingly, even random sequence shuffling enhances performance in some tasks. Chapter 8 presents a time series denoising algorithm for gravitational wave data. Traditional Recurrent Neural Networks (RNNs) and Denoising Auto-Encoders (DAEs) perform well in high signal-to-noise ratio (SNR) scenarios but struggle with non-Gaussian, non-stationary noise. To address this, we developed Enhanced Deep Recurrent Denoising Auto-Encoder (EDRDAE), which integrates a signal amplifier layer and employs curriculum learning—starting with high-SNR signals and gradually introducing noisier data. EDRDAE achieves superior denoising performance, generalizing well to complex signal topologies and outperforming existing gravitational wave denoising algorithms.

Graduation Semester

2025-05

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/129169

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

On the study of various methods for domain generalization and inverse problems

Shen, Hongyu

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In