This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/127404
Description
Title
Multi-modal learning for image and beyond
Author(s)
Jiang, Qian
Issue Date
2024-12-06
Director of Research (if dissertation) or Advisor (if thesis)
Do, Minh N.
Doctoral Committee Chair(s)
Do, Minh N.
Committee Member(s)
Chen, Deming
Schwing, Alexander
Zhao, Han
Yeh, Raymond A.
Department of Study
Electrical & Computer Eng
Discipline
Electrical & Computer Engr
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
Ph.D.
Degree Level
Dissertation
Keyword(s)
Machine Learning
Multimodal
Abstract
The rapid advancement of technology has propelled the field of machine learning into new frontiers, with a particular emphasis on multi-modal learning approaches. This thesis explores the integration of multiple modalities with visual data, investigating three distinct yet complementary directions: text integration for enhanced understanding, hardware optimization for efficient deployment, and medical image representation learning for robust diagnostics.
The study begins by delving into the integration of image and text modality. We first prove that exact modality alignment is sub-optimal in general for downstream prediction tasks. Thus we propose three general approaches to construct latent modality structures. We test our model on a variety of tasks including zero/few-shot image classification, image-text retrieval, visual question answering, visual reasoning, and visual entailment. Our method achieves consistent improvements over existing methods demonstrating the effectiveness and generalizability.
We then extend the focus to image and hardware modality. We propose End-to-end Hardware-aware Differentiable Neural Architecture Search (EH-DNAS), a seamless integration of end-to-end hardware benchmarking, and fully automated DNAS to deliver hardware-efficient deep neural networks on various platforms such as mobile devices and dedicated AI accelerators. Experiments on CIFAR10 and ImageNet show that EH-DNAS improves the hardware performance by an average of $1.5\times$ on customized accelerators and existing hardware processors than the state-of-the-art hardware-efficient networks while maintaining the classification accuracy.
Finally, we address the challenges of multimodal learning in medical image representation learning, focusing on the B-mode and M-mode Optical Coherence Tomography (OCT) images. We proposed a novel triplet-based learning framework specifically designed for medical image applications with label noise. Our approach demonstrates superior performance on OCT disease classification, achieving up to 98.44\% accuracy on M-mode and 94.12\% on B-mode, achieving effective multimodal learning in scenarios with imperfect alignment.
Collectively, this work advances multimodal learning across three dimensions: theoretical insights about modality alignment and methods for construct modality structures for image-text tasks, practical solutions for hardware-aware deployment, and robust frameworks for medical applications. Our comprehensive study, spanning from theoretical foundations to real-world implementations, provides both fundamental understanding and practical solutions for the evolving challenges in multimodal learning.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.