Withdraw
Loading…
Towards unsupervised speech technology with fewer resources
Ni, Junrui
Loading…
Permalink
https://hdl.handle.net/2142/129378
Description
- Title
- Towards unsupervised speech technology with fewer resources
- Author(s)
- Ni, Junrui
- Issue Date
- 2025-03-31
- Director of Research (if dissertation) or Advisor (if thesis)
- Hasegawa-Johnson, Mark
- Doctoral Committee Chair(s)
- Hasegawa-Johnson, Mark
- Committee Member(s)
- Schwing, Alexander
- Bhat, Suma
- Shomorony, Ilan
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Unsupervised speech processing
- Automatic speech recognition
- Text-to-speech synthesis
- Abstract
- Recent advancements in supervised automatic speech recognition (ASR) and text-to-speech synthesis (TTS) have achieved remarkable performance, primarily driven by the increasing availability of large transcribed speech corpora. In this work, we propose an unsupervised TTS system and a whole-word unsupervised ASR system, aiming to advance fully unsupervised speech technology that can be developed with minimal supervision. An unsupervised TTS system learns to generate the speech waveform corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech. Developing such a system can significantly improve the availability of speech technology to languages without a large amount of parallel speech and text data. This work proposes an unsupervised TTS system that trains on the pseudo-transcripts from an unsupervised ASR system. Our unsupervised system can achieve comparable performance to the supervised system in seven languages with about 10-20 hours of speech each. A careful study on the effect of text units and vocoders has also been conducted to better understand what factors may affect unsupervised TTS performance. We further tackle the existing challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR and experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. Using a curated speech corpus containing a fixed number of English words, our system iteratively refines the word segmentation structure and achieves a word error rate of between 20-23%, depending on the vocabulary size, without parallel transcripts, oracle word boundaries, or a pronunciation lexicon. This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
- Graduation Semester
- 2025-05
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129378
- Copyright and License Information
- Copyright 2025 Junrui Ni
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…