Knowledge-aware data generation

Jin, Xiaomeng

Knowledge-aware data generation

Jin, Xiaomeng

Permalink

https://hdl.handle.net/2142/129847

Description

Title

Knowledge-aware data generation

Author(s)

Jin, Xiaomeng

Issue Date

2025-07-10

Director of Research (if dissertation) or Advisor (if thesis)

Ji, Heng

Doctoral Committee Chair(s)

Ji, Heng

Committee Member(s)

Zhai, Chengxiang
Hakkani-Tur, Dilek
Chang, Kai-Wei
Dong, Xin (Luna)

Department of Study

Siebel School Comp & Data Sci

Discipline

Computer Science

Degree Granting Institution

University of Illinois Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Data Generation
Knowledge Base
Data Augmentation
Multimodality
Adversarial Robustness

Language

eng

Abstract

Deep learning models have achieved remarkable success across a wide range of tasks. However, their training typically requires vast amounts of manually annotated data, which is both time-consuming and expensive to produce. To address this challenge, data generation has emerged as a widely adopted approach. By automatically synthesizing new labeled data at a lower cost, data generation expands training datasets and enhances model performance. In addition, it plays a crucial role in improving model robustness by generating adversarial examples that are challenging data points to examine model vulnerabilities. Analyzing these failure cases enables targeted improvements, leading to more robust and trustworthy models. While various data generation strategies have been proposed to improve model performance and robustness in both text and multimodal domains, existing methods still suffer from significant limitations, including low diversity, semantic inconsistencies, and high computational costs. These issues mainly come from the lack of structured guidance in the data generation process. To overcome these challenges, we propose knowledge-guided data generation strategies that leverage structured knowledge to enhance the quality, reliability, and efficiency of generated data. Specifically, our work addresses three major challenges in data generation: 1. Enhancing Diversity through Schema-Guided Text Augmentation. Generating diverse augmented data remains a challenging task due to the structural complexity of natural language and real-world scenarios. Traditional data augmentation techniques often rely on simple transformations, such as word masking, synonym replacement, and sentence shuffling for text, or random cropping and rotation for images. However, these approaches can only generate semantically shallow variations and fail to introduce new contexts, which results in limited diversity in the augmented datasets. To address this limitation, we propose a schema-based text data augmentation framework that utilizes structured knowledge in the form of event schema graphs. The event schema graphs encode common event structures, relationships, and constraints between entities and events in real-world scenarios. By leveraging schema-based structured knowledge, our framework enables the generation of diverse, semantically coherent, and contextually rich texts. By applying to news generation, this method ensures that augmented articles are similar to realistic event structures, improving both model generalization and robustness. 2. Ensuring Cross-Modal Consistency with Knowledge-Based Multimodal Augmentation. Consistency is a fundamental requirement in data generation, which ensures that synthesized data do not introduce contradictions within or across modalities. However, existing multimodal augmentation methods often process different modalities independently, applying single transformations to images and texts separately without considering their inter-dependencies. These augmentations often lead to semantic inconsistencies, where image transformations do not align with corresponding textual descriptions, thus introducing noise into training datasets and reducing model performance. To resolve this issue, we propose an attribute-based multimodal data augmentation approach that leverages a structured knowledge base of visual attributes. This knowledge base extracts semantic attributes from images (e.g., objects, colors, textures, actions) and ensures that transformations across modalities remain semantically aligned. By integrating structured knowledge into the augmentation process, our approach maintains cross-modal consistency, improves the quality and reliability of synthesized data, and enhances the robustness of multimodal learning models. Furthermore, we extend this structured knowledge approach by utilizing an open-world concept library framework that encodes relationships between concepts, their compositional parts, and their affordances. This ontology enables a part-based reasoning, allowing models to infer relationships between parts and their affordances in different contexts. It can facilitate the generation of novel concepts by recombining affordances from existing concepts. 3. Improving Efficiency with Knowledge-Guided Adversarial Example Generation. Efficiency is an important factor in scalable data generation, as computationally expensive approaches limit the practicality of generating large, diverse datasets. Existing adversarial data generation methods often rely on brute-force sampling, generating a large pool of perturbed examples and then filtering them based on model feedback. This approach is highly inefficient, leading to excessive computational costs and limiting real-world applicability. To address this challenge, we develop a knowledge-aware adversarial example generation framework that significantly reduces computational costs while maintaining effectiveness. Specifically, we construct a structured knowledge base from Wikipedia and integrate it with word attribution techniques and large language models to precisely perform critical perturbations. Unlike traditional adversarial methods that rely on random perturbations or black-box attacks, our approach leverages structured knowledge to identify and generate adversarial examples that challenge model weaknesses efficiently. By incorporating knowledge-driven strategies, our method ensures that adversarial examples remain realistic, interpretable, and semantically meaningful, while significantly improving efficiency in adversarial robustness evaluation and model training. By leveraging structured knowledge, our work directly addresses key challenges in data generation—diversity, consistency, and efficiency—while exploring the broader fields of model robustness, generalization, and AI-driven design. Our findings demonstrate that structured knowledge can guide data generation and ensure that synthesized data aligns with real-world patterns, reduces noise, and enhances model performance. In the future, we aim to further develop knowledge-guided frameworks for open-world reasoning and concept synthesis, extending to recommendation system domain, enabling deeper generalization testing in AI systems, and exploring data augmentation for LLM self-improvement.

Graduation Semester

2025-08

Type of Resource

Text

Handle URL

https://hdl.handle.net/2142/129847

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Knowledge-aware data generation

Jin, Xiaomeng

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In