Files in this item

FilesDescriptionFormat

application/pdf

application/pdfQIAN-DISSERTATION-2020.pdf (4MB)Restricted to U of Illinois
(no description provided)PDF

Description

Title:Deep generative models for speech editing
Author(s):Qian, Kaizhi
Director of Research:Hasegawa-Johnson, Mark A
Doctoral Committee Chair(s):Hasegawa-Johnson, Mark A
Doctoral Committee Member(s):Levinson, Steven E; Varshney, Lav R; Chang, Shiyu
Department / Program:Electrical & Computer Eng
Discipline:Electrical & Computer Engr
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):Generative model
Speech enhancement
Speech disentanglement
Abstract:Generative models are very useful for generating and modifying natural-sounding speech in various speech processing tasks such as speech synthesis, speech enhancement, and voice conversion. There are two ways that the generative models can help in naturalness for speech processing. The first way is to regularize the speech editing process by defining the sample space of natural speech, and the second way is by permitting the separable modification of components of hierarchical speech generative models to modify specified components of natural speech. In particular, four research projects are introduced, where the first two use WaveNet as the clean speech generative model for single-channel and multi-channel speech enhancement; and the last two projects modify different speaking styles by modeling different speech components using autoencoders. Multi-channel speech enhancement with ad-hoc sensors has been a challenging task. Speech model guided beamforming algorithms can recover natural-sounding speech, but the speech models tend to be oversimplified to prevent the inference from becoming too complicated. On the other hand, deep learning-based enhancement approaches can learn complicated speech distributions and perform efficient inference, but they are unable to deal with a variable number of input channels. Also, deep learning approaches introduce many errors, particularly in the presence of unseen noise types and settings. Therefore an enhancement framework called DeepBeam is proposed, which combines the two complementary classes of algorithms. DeepBeam introduces a beamforming filter to produce natural-sounding speech, but the filter coefficients are determined with the help of a WaveNet-based monaural speech enhancement model. Experiments on synthetic and real-world data show that DeepBeam can produce clean, dry, and natural-sounding speech, and is robust against unseen noise. For single-channel speech enhancement, the existing deep learning-based methods still have two limitations. First, the Bayesian framework is not adopted in many such deep-learning-based algorithms. Second, the majority of the existing methods operate on the frequency domain of the noisy speech, such as the spectrogram and its variations. A Bayesian speech enhancement framework, called BaWN (Bayesian WaveNet) is proposed, which directly operates on raw audio samples. It uses the WaveNet as the prior model to regularize the output to be in the speech space and thus improving the performance. Experiments show that BaWN can recover clean and natural speech. Non-parallel many-to-many voice conversion, as well as zero-shot voice conversion, remain under-explored areas. Deep style transfer algorithms, such as generative adversarial networks (GAN) and conditional variational autoencoder (CVAE), are popular solutions in this field. However, GAN training is sophisticated and difficult, and there is no strong evidence that its generated speech is of good perceptual quality. On the other hand, CVAE training is simple but does not come with the distribution-matching property as in GAN. A new style transfer scheme that involves only an autoencoder with a carefully designed bottleneck is proposed. This scheme can achieve distribution-matching style transfer by training only on a self-reconstruction loss. Based on this scheme, AutoVC is proposed, which achieves state-of-the-art results in many-to-many voice conversion with non-parallel data, and which is also the first to perform zero-shot voice conversion. Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm. Obtaining disentangled representations of these components is useful in many speech analysis and generation applications. Recently, state-of-the-art voice conversion systems have led to speech representations that can disentangle speaker-dependent and independent information. However, these systems can only disentangle timbre, while information about pitch, rhythm, and content is still mixed. Further disentangling the remaining speech components is an under-determined problem in the absence of explicit annotations for each component, which are difficult and expensive to obtain. To further explore this problem, we propose SpeechSplit, which can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks. SpeechSplit is among the first algorithms that can separately perform style transfer on timbre, pitch, and rhythm without text labels.
Issue Date:2020-12-01
Type:Thesis
URI:http://hdl.handle.net/2142/109510
Rights Information:Copyright 2020 Kaizhi Qian
Date Available in IDEALS:2021-03-05
Date Deposited:2020-12


This item appears in the following Collection(s)

Item Statistics