Audiovisual processing for generation and enhancement

Fan, Xulin

Audiovisual processing for generation and enhancement

Fan, Xulin

This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.

Permalink

https://hdl.handle.net/2142/127369

Description

Title

Audiovisual processing for generation and enhancement

Author(s)

Fan, Xulin

Issue Date

2024-11-27

Director of Research (if dissertation) or Advisor (if thesis)

Hasegawa-Johnson, Mark Allan

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Multimodal Signal Processing
Speech Processing
Speech Enhancement
Signal Processing

Abstract

With advances in deep neural networks and the increasing availability of computational power, machine learning researchers working on unimodal tasks, such as text, audio, or vision, have begun to explore methods that address multimodal problems, which involve inputs from multiple modalities. Each modality carries distinct types of information represented in different formats. For example, audiovisual tasks typically involve an audio-visual aligned video, where the video channel is essentially a sequence of images represented as a 4D tensor (number of frames, RGB channels, height, width), while the audio channel is a 1D signal sampled at a much higher rate. Popular multimodal neural architectures generally consist of three stages: modality-specific encoders, modality fusion, and one or more task-specific decoders. To handle inputs of varying formats, a common design choice is to employ modality-specific encoders, which project each modality into a learnable embedding space. This embedding space is structured to facilitate cross-modal similarity and ease the subsequent fusion process. The modality fusion step, however, is tailored to the requirements of the specific downstream task. For instance, an audiovisual task that outputs an audio signal, such as audiovisual speech enhancement, may require high temporal resolution, whereas tasks that produce textual outputs, such as audiovisual automatic speech recognition, may prioritize different fusion strategies. In this thesis, we explore various design choices for two audiovisual tasks: audio-driven talking head synthesis and audiovisual target speaker extraction. Through extensive experimentation, we identify key considerations for adapting transformer-based and diffusion-based methods to multimodal scenarios.

Graduation Semester

2024-12

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/127369

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Audiovisual processing for generation and enhancement

Fan, Xulin

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In