Data augmentation and data efficiency for low-resource language processing
Zhou, Jianing
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/129524
Description
Title
Data augmentation and data efficiency for low-resource language processing
Author(s)
Zhou, Jianing
Issue Date
2025-04-14
Director of Research (if dissertation) or Advisor (if thesis)
Bhat, Suma
Doctoral Committee Chair(s)
Bhat, Suma
Committee Member(s)
Han, Jiawei
Zhai, Chengxiang
Peng, Hao
Bond, William
Department of Study
Computer Science
Discipline
Computer Science
Degree Granting Institution
University of Illinois Urbana-Champaign
Degree Name
Ph.D.
Degree Level
Dissertation
Keyword(s)
low-resource language
data augmentation
data efficiency
multi-modal models
Language
eng
Abstract
Language processing in low-resource settings presents unique challenges due to the scarcity of labeled data, linguistic diversity, and the complexity of learning robust representations under data constraints. This dissertation explores novel approaches to enhance neural models in low-resource settings through data augmentation and data-efficient learning strategies. Specifically, we introduce methods for generating synthetic training data, improving data efficiency via contrastive and curriculum learning, and extending them to data from modalities beyond text. The thesis first investigates data augmentation techniques, proposing the BART-IBT framework, which employs weakly supervised and unsupervised approaches to expand parallel datasets for idiomatic expression paraphrasing. Additionally, a multi-view augmentation method is introduced to inject idiomatic knowledge into pre-trained language models, improving their ability to process non-compositional expressions across tasks. To further develop the idea of multi-view data augmentation, it is extended to multi-modal tasks to better process low-resource speeches including different dialect. To address data efficiency, we develop CLCL, a novel framework that integrates contrastive and curriculum learning to enhance low-resource language understanding, particularly for non-compositional expressions such as idioms and metaphors. This approach dynamically adjusts training difficulty based on model performance, leading to improved generalization and robustness. Then we explore a harder task, which is about non-compositional expression generation under a naturally low-resource scenario. Curriculum learning and continual learning are combined together for a better utilization of available data resources. Furthermore, we extend to multi-modal scenarios, introducing CLASP, a framework for efficiently aligning different modalities, such as speech and text, under low-resource conditions. This model leverages contrastive learning and modality-specific representations to improve cross-modal understanding. Additionally, a multi-view contrastive learning strategy is proposed for dialectal speech processing, allowing efficient adaptation to new linguistic variations with minimal data. Through extensive experiments across multiple tasks and areas, the proposed methods demonstrate significant improvements over existing baselines. this thesis presents that (1) Compared to the traditional dataset-level augmentation methods, it is better and more efficient to augment the data on an instance-level; (2) Curriculum learning is beneficial for utilizing limited data in an efficient way. However, it is important to enhance it by connecting inter-dataset examples and mitigate the catastrophic forgetting problem and (3) Methods proposed solely for text cannot be perfectly transferred to different modalities, which calls for specific mechanisms for different modalities. The findings contribute to advancing low-resource language processing capabilities, offering scalable and efficient solutions applicable to diverse linguistic and multi-modal domains. Beyond the empirical findings, this thesis also offers practical and speculative recommendations including the importance of avoiding over-reliance on large models without domain adaptation, the benefits of incorporating curriculum or contrastive learning while solving the problems in traditional curriculum learning and contrastive learning, and the necessity of integrating mechanisms for modality-aware alignment and context-specific adaptation to bridge the performance gap in low-resource scenarios.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.