Withdraw
Loading…
Data augmentation and data efficiency for low-resource language processing
Zhou, Jianing
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/129524
Description
- Title
- Data augmentation and data efficiency for low-resource language processing
- Author(s)
- Zhou, Jianing
- Issue Date
- 2025-04-14
- Director of Research (if dissertation) or Advisor (if thesis)
- Bhat, Suma
- Doctoral Committee Chair(s)
- Bhat, Suma
- Committee Member(s)
- Han, Jiawei
- Zhai, Chengxiang
- Peng, Hao
- Bond, William
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- low-resource language
- data augmentation
- data efficiency
- multi-modal models
- Abstract
- Language processing in low-resource settings presents unique challenges due to the scarcity of labeled data, linguistic diversity, and the complexity of learning robust representations under data constraints. This dissertation explores novel approaches to enhance neural models in low-resource settings through data augmentation and data-efficient learning strategies. Specifically, we introduce methods for generating synthetic training data, improving data efficiency via contrastive and curriculum learning, and extending them to data from modalities beyond text. The thesis first investigates data augmentation techniques, proposing the BART-IBT framework, which employs weakly supervised and unsupervised approaches to expand parallel datasets for idiomatic expression paraphrasing. Additionally, a multi-view augmentation method is introduced to inject idiomatic knowledge into pre-trained language models, improving their ability to process non-compositional expressions across tasks. To further develop the idea of multi-view data augmentation, it is extended to multi-modal tasks to better process low-resource speeches including different dialect. To address data efficiency, we develop CLCL, a novel framework that integrates contrastive and curriculum learning to enhance low-resource language understanding, particularly for non-compositional expressions such as idioms and metaphors. This approach dynamically adjusts training difficulty based on model performance, leading to improved generalization and robustness. Then we explore a harder task, which is about non-compositional expression generation under a naturally low-resource scenario. Curriculum learning and continual learning are combined together for a better utilization of available data resources. Furthermore, we extend to multi-modal scenarios, introducing CLASP, a framework for efficiently aligning different modalities, such as speech and text, under low-resource conditions. This model leverages contrastive learning and modality-specific representations to improve cross-modal understanding. Additionally, a multi-view contrastive learning strategy is proposed for dialectal speech processing, allowing efficient adaptation to new linguistic variations with minimal data. Through extensive experiments across multiple tasks and areas, the proposed methods demonstrate significant improvements over existing baselines. this thesis presents that (1) Compared to the traditional dataset-level augmentation methods, it is better and more efficient to augment the data on an instance-level; (2) Curriculum learning is beneficial for utilizing limited data in an efficient way. However, it is important to enhance it by connecting inter-dataset examples and mitigate the catastrophic forgetting problem and (3) Methods proposed solely for text cannot be perfectly transferred to different modalities, which calls for specific mechanisms for different modalities. The findings contribute to advancing low-resource language processing capabilities, offering scalable and efficient solutions applicable to diverse linguistic and multi-modal domains. Beyond the empirical findings, this thesis also offers practical and speculative recommendations including the importance of avoiding over-reliance on large models without domain adaptation, the benefits of incorporating curriculum or contrastive learning while solving the problems in traditional curriculum learning and contrastive learning, and the necessity of integrating mechanisms for modality-aware alignment and context-specific adaptation to bridge the performance gap in low-resource scenarios.
- Graduation Semester
- 2025-05
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129524
- Copyright and License Information
- Copyright 2025 Jianing Zhou
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…