Files in this item

FilesDescriptionFormat

application/pdf

application/pdf1_Kim_Lae-Hoon.pdf (7MB)
(no description provided)PDF

Description

Title:Statistical Model Based Multi-Microphone Speech Processing: Toward Overcoming Mismatch Problem
Author(s):Kim, Lae-Hoon
Director of Research:Hasegawa-Johnson, Mark A.
Doctoral Committee Chair(s):Hasegawa-Johnson, Mark A.
Doctoral Committee Member(s):Levinson, Stephen E.; Do, Minh N.; Fleck, Margaret M.
Department / Program:Electrical & Computer Eng
Discipline:Electrical & Computer Engr
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):Independent component analysis
beamforming
Expectation maximization beamforming (EMB)
robust automatic speech recognition
missing feature
Abstract:In this thesis, a joint optimal method for clean speech estimation and ASR in a mismatched condition will be described with a unified speech model under a generalized expectation maximization (GEM) scheme. From this perspective, multi-microphone optimal speech estimation can be interpreted as pre-processing to increase reliability of feature components before the actual speech recognition or model based speech estimation is performed. Also, ideal binary mask (IBM) estimation from the context of the statistical model for ASR can be regarded as an initialization step to exclude the unreliable portion for ASR and to increase the estimation accuracy based only on the reliable components and trained speech process model. Optimal multi-microphone speech processing is performed in the short-time Fourier transform (STFT) domain, since the atomic speech information can be meaningfully represented with a series of 10 to 30 ms short frames. Convolution in the time domain is formulated as filtering via a feed-forward network in the STFT domain, and is shown to be an appropriate representation under the overlap-add framework. With this structure in mind, sufficient statistics for estimating target speech from the multi-microphone measurements are formulated, and realistic relaxations for them are discussed since we need to estimate not only the target speech information but also the room impulse responses (RIRs), which have unavoidable uncertainty due to the movement of speakers. Firstly, reverberant speech mixture separation with typical background noise is tackled. Standard adaptive independent component analysis (ICA) implemented with the natural gradient method is extended into the STFT domain with regularized feed-forward ICA (RFFICA) and post-processing based on direction-per-frequency. This method showed up to almost an order of magnitude performance improvement (29 dB in C-weighting) compared with the state of the art methods. Secondly, we try to update the filters fast enough, with a smaller amount of measured data sharing the same directional information about target and interference location. Expectation maximization beamforming (EMB) followed by minimum mean squared error (MMSE) post-filtering is proposed to reduce the number of filter taps to update. Because we can obtain generative model based information about the target speech presence probability per each frequency bin and per each frame with enhanced robust DOA estimation capability, EMB can also be used to replace the direction-per-frequency based post-processing, which has been applied independently after RFFICA. Thirdly, the DOA only based beamforming is extended to early response based beamforming. We estimate the RIRs from target and interference speech given the robust estimation on DOAs and construct linearly constrained minimum variance (LCMV) beamforming, which can be easily extended with the EMB framework. Because we perform a two-step approach, estimating RIR first and applying a demixing filter, without introducing more taps in the frame for adaptation purposes, we can have good demixing or dereverberation results. Finally, IBM estimation and ASR are jointly formulated under a GEM framework. Even with the optimal front-end pre-processing, there always exists a mismatched portion with the statistical speech process model which is going to be used for ASR. Therefore, identifying the corrupted portions and removing them in ASR from the perspective of ASR itself is a necessary procedure. The cepstral domain ASR models are transformed into the spectral domain without loss of information through the global tying process. The proposed algorithm achieved much higher absolute ASR accuracy, ranging from 14.69% at 0 dB signal-to-noise ratio (SNR) to 40.10% at 15 dB SNR, than a normal ASR method with an optimal front-end processing in a highly non-stationary mismatch environment.
Issue Date:2010-08-20
URI:http://hdl.handle.net/2142/16839
Rights Information:Copyright 2010 Lae-Hoon Kim
Date Available in IDEALS:2010-08-20
Date Deposited:2010-08


This item appears in the following Collection(s)

Item Statistics