Files in this item



application/pdfECE499-Sp2017-xu.pdf (1MB)Restricted to U of Illinois
(no description provided)PDF


Title:Multimodal LSTM for audio-visual speech recognition
Author(s):Xu, Yijia
Contributor(s):Hasegawa-Johnson, Mark
Subject(s):audio-visual speech recognition
speech recognition
long short term memory
connectionist temporal classification
multi layer perceptron
multimodal fusion
deep neural network
phoneme recognition
Abstract:Automatic speech recognition (ASR) permits effective interaction between humans and machines in environments where typing is impossible. Some environments, however, are more difficult than others: acoustic noise disrupts ASR. This research focuses on audio-visual speech recognition (AVSR), which serves to improve noise robustness during speech recognition with the aid of visual speech information from a speaker's mouth region. This research includes a lip tracking system, and a system for extracting effective audio and visual features for building an audio-visual speech recognition system. A context-independent phoneme dictionary is also built for extracting corresponding 42 phoneme labels for 3896 tri phone states (trained by Intel on Intel data). Two methods for audio-visual speech recognition are proposed and compared. The first method upsamples visual frames to force align with the audio frames as well as the context independent phoneme labels. Unimodel deep networks are trained using LSTM separately for audio and visual network on AVICAR dataset, and their posteriors are fused to obtain the multimodal speech recognition. The second method uses Connectionist Temporal Classification (CTC) objective function for LSTM. It does not require strict alignment between audio-visual frames and target labels. It automatically labels the unsegmented sequence of audio and visual data and then trains a classification neural network using a training criterion based on the automatic alignment, which is revised during every training iteration. The neural network trained is then used to perform audio-visual phoneme recognition. Results include an overall accuracy of 48.91% on audio only phoneme recognition, and an overall accuracy of 38.57% on visual only phoneme recognition, which outperforms the traditional deep neural networks in phoneme recognition accuracy of 24.39%. The audio-visual phoneme recognition achieves higher accuracy than the audio only speech recognition by 0.04%%. The CTC loss function turns the speech recognition into an end-to-end convenient process. It achieves relatively high recognition accuracy (72.97%) on small number of class classifications using best path decoding measurements.
Issue Date:2017-05
Date Available in IDEALS:2017-08-30

This item appears in the following Collection(s)

Item Statistics