Data augmentation and data efficiency for low-resource language processing

Zhou, Jianing

Data augmentation and data efficiency for low-resource language processing

Zhou, Jianing

This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.

Permalink

https://hdl.handle.net/2142/129524

Description

Title

Data augmentation and data efficiency for low-resource language processing

Author(s)

Zhou, Jianing

Issue Date

2025-04-14

Director of Research (if dissertation) or Advisor (if thesis)

Bhat, Suma

Doctoral Committee Chair(s)

Bhat, Suma

Committee Member(s)

Han, Jiawei
Zhai, Chengxiang
Peng, Hao
Bond, William

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

low-resource language
data augmentation
data efficiency
multi-modal models

Language

eng

Abstract

Language processing in low-resource settings presents unique challenges due to the scarcity of labeled data, linguistic diversity, and the complexity of learning robust representations under data constraints. This dissertation explores novel approaches to enhance neural models in low-resource settings through data augmentation and data-efficient learning strategies. Specifically, we introduce methods for generating synthetic training data, improving data efficiency via contrastive and curriculum learning, and extending them to data from modalities beyond text. The thesis first investigates data augmentation techniques, proposing the BART-IBT framework, which employs weakly supervised and unsupervised approaches to expand parallel datasets for idiomatic expression paraphrasing. Additionally, a multi-view augmentation method is introduced to inject idiomatic knowledge into pre-trained language models, improving their ability to process non-compositional expressions across tasks. To further develop the idea of multi-view data augmentation, it is extended to multi-modal tasks to better process low-resource speeches including different dialect. To address data efficiency, we develop CLCL, a novel framework that integrates contrastive and curriculum learning to enhance low-resource language understanding, particularly for non-compositional expressions such as idioms and metaphors. This approach dynamically adjusts training difficulty based on model performance, leading to improved generalization and robustness. Then we explore a harder task, which is about non-compositional expression generation under a naturally low-resource scenario. Curriculum learning and continual learning are combined together for a better utilization of available data resources. Furthermore, we extend to multi-modal scenarios, introducing CLASP, a framework for efficiently aligning different modalities, such as speech and text, under low-resource conditions. This model leverages contrastive learning and modality-specific representations to improve cross-modal understanding. Additionally, a multi-view contrastive learning strategy is proposed for dialectal speech processing, allowing efficient adaptation to new linguistic variations with minimal data. Through extensive experiments across multiple tasks and areas, the proposed methods demonstrate significant improvements over existing baselines. this thesis presents that (1) Compared to the traditional dataset-level augmentation methods, it is better and more efficient to augment the data on an instance-level; (2) Curriculum learning is beneficial for utilizing limited data in an efficient way. However, it is important to enhance it by connecting inter-dataset examples and mitigate the catastrophic forgetting problem and (3) Methods proposed solely for text cannot be perfectly transferred to different modalities, which calls for specific mechanisms for different modalities. The findings contribute to advancing low-resource language processing capabilities, offering scalable and efficient solutions applicable to diverse linguistic and multi-modal domains. Beyond the empirical findings, this thesis also offers practical and speculative recommendations including the importance of avoiding over-reliance on large models without domain adaptation, the benefits of incorporating curriculum or contrastive learning while solving the problems in traditional curriculum learning and contrastive learning, and the necessity of integrating mechanisms for modality-aware alignment and context-specific adaptation to bridge the performance gap in low-resource scenarios.

Graduation Semester

2025-05

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/129524

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Data augmentation and data efficiency for low-resource language processing

Zhou, Jianing

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In