Audiovisual processing for generation and enhancement
Fan, Xulin
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/127369
Description
Title
Audiovisual processing for generation and enhancement
Author(s)
Fan, Xulin
Issue Date
2024-11-27
Director of Research (if dissertation) or Advisor (if thesis)
Hasegawa-Johnson, Mark Allan
Department of Study
Electrical & Computer Eng
Discipline
Electrical & Computer Engr
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Multimodal Signal Processing
Speech Processing
Speech Enhancement
Signal Processing
Abstract
With advances in deep neural networks and the increasing availability of computational power, machine learning researchers working on unimodal tasks, such as text, audio, or vision, have begun to explore methods that address multimodal problems, which involve inputs from multiple modalities. Each modality carries distinct types of information represented in different formats. For example, audiovisual tasks typically involve an audio-visual aligned video, where the video channel is essentially a sequence of images represented as a 4D tensor (number of frames, RGB channels, height, width), while the audio channel is a 1D signal sampled at a much higher rate. Popular multimodal neural architectures generally consist of three stages: modality-specific encoders, modality fusion, and one or more task-specific decoders. To handle inputs of varying formats, a common design choice is to employ modality-specific encoders, which project each modality into a learnable embedding space. This embedding space is structured to facilitate cross-modal similarity and ease the subsequent fusion process. The modality fusion step, however, is tailored to the requirements of the specific downstream task. For instance, an audiovisual task that outputs an audio signal, such as audiovisual speech enhancement, may require high temporal resolution, whereas tasks that produce textual outputs, such as audiovisual automatic speech recognition, may prioritize different fusion strategies. In this thesis, we explore various design choices for two audiovisual tasks: audio-driven talking head synthesis and audiovisual target speaker extraction. Through extensive experimentation, we identify key considerations for adapting transformer-based and diffusion-based methods to multimodal scenarios.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.