Files in this item



application/pdfWU-THESIS-2020.pdf (509kB)Restricted to U of Illinois
(no description provided)PDF


Title:Semi-supervised cycle-consistency training for end-to-end ASR using unpaired speech
Author(s):Wu, Ningkai
Advisor(s):Hasegawa-Johnson, Mark
Department / Program:Electrical & Computer Eng
Discipline:Electrical & Computer Engr
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Speech recognition
Semi-supervised training
Abstract:The thesis is a replication of the work by Takaaki Hori and his colleagues (2019), which introduces a new method to train end-to-end automatic speech recognition (ASR) models using unpaired speech. In general, large amounts of paired data (speech and text) are needed to train an end-to-end automatic speech recognition system. To alleviate the problem of limited paired data, the idea of cycle-consistency losses has been proposed recently in areas such as machine translation and computer vision. In ASR, cycle-consistency training is achieved by building a reverse system, e.g., a text-to-speech system, and designing a loss based on the reconstructed signal and the original one. However, it is not straightforward to apply cycle-consistency in ASR as information would be lost in the text bottleneck. Tomoki Hayashi et al. (2018) tackled this problem via a text-to-encoder (TTE) model, which predicts encoder states extracted by a pre-trained end-to-end ASR encoder from text input. In this work, the TTE model was used as the reverse system and a loss was defined by comparing the original ASR encoder states and the reconstructed encoder states from the TTE model. Using encoder states instead of raw acoustic features as targets, the model can learn attention much faster and avoid the modeling of speaker dependencies. Our experimental results on the LibriSpeech corpus were similar to the results of Hori et al. The initial ASR and TTE models were trained with LibriSpeech 100-hour paired speech data. By applying cycle-consistency loss and retraining the speech-to-text-to-encoder chain model using one third of LibriSpeech 360-hour unpaired speech data, ASR word error rate was reduced from 25.8% to 21.7% on the LibriSpeech 5-hour test data.
Issue Date:2020-05-14
Rights Information:Copyright 2020 Ningkai Wu
Date Available in IDEALS:2020-08-26
Date Deposited:2020-05

This item appears in the following Collection(s)

Item Statistics