Automated structuring of text space with minimal supervision

Zhang, Yunyi

Automated structuring of text space with minimal supervision

Zhang, Yunyi

Permalink

https://hdl.handle.net/2142/132485

Description

Title

Automated structuring of text space with minimal supervision

Author(s)

Zhang, Yunyi

Issue Date

2025-11-13

Director of Research (if dissertation) or Advisor (if thesis)

Han, Jiawei

Doctoral Committee Chair(s)

Han, Jiawei

Committee Member(s)

Abdelzaher, Tarek
Tong, Hanghang
Dong, Xin Luna

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Text Mining
Natural Language Processing
Information Retrieval
Text Classification
Large Language Models

Language

eng

Abstract

Our society has been immersed with massive unstructured text data, posing great challenges for people to fetch needed data, digest critical information, and derive actionable knowledge. The need for text space structuring has attracted a lot of research, but we are still far from solving the real problem. Recent advances in deep learning and large pre-trained language models have made great progress in natural language understanding. However, we still face some major challenges: (1) text classification still relies on substantial amount of labeled training data; (2) most current text classifiers confine to a small number (e.g., below 20) of single-layered, coarse-grained classes but the real need is at a fine-grained level; (3) existing methods are developed to classify text in a single dimension, but real-world applications often involves multiple orthogonal dimensions, like finding news articles according to topic, location, and time simultaneously. Only by structuring text according to each of these dimensions with sufficiently detailed taxonomy can we really generate ready-to-use structured knowledge from the unstructured text data. To bridge this gap, this dissertation aims to develop weakly supervised methods to structure the text space in a multi-granular and multi-aspect way. To accomplish this goal, the following tasks are studied. 1. Taxonomy Construction and Enrichment. Constructing a hierarchical representation of knowledge from textual data is a crucial first step towards a structured text space. While existing works rely on substantial human efforts on maintaining such a taxonomic structure, I will introduce several works that automatically build and enrich taxonomies, which serves as a preliminary of later parts. 2. Weakly-Supervised Text Classification. Given the class surface names as the only supervision, the weakly-supervised text classification task aims to train a text classifier that can tag each document with one or more classes. I developed two methods in this direction: PIEClass is a weakly-supervised flat text classification method which proposes a noise-robust self-training method by combining different fine-tuning methods of pre-trained language models, and TELEClass is a weakly-supervised hierarchical text classification method that explores how large generative model can understand large hierarchical label spaces and combine its power with corpus-based knowledge. 3. Text Classification with Temporal Information. Classifying text in one dimension often is not sufficient for digesting the text space. Therefore, I further study text classification in the temporal dimension. The event discovery task aims to find clusters of news articles that are thematically similar and temporally close to each other, which likely indicate a real-world event happening. I will introduce EvMine, which first identifies peak phrases in the temporal dimension as candidate key events and then classify relevant documents for each event. 4. Application of Structured Text Spaces: Scientific Paper Retrieval. After constructing a structured text space, we would like to study how it can benefit an essential downstream application, scientific paper retrieval. I propose SemRank, a plug-and-play ranking method that combines large language models with structured text corpora in the form of semantic indexes to improve the retrieval performance of base retrievers. Overall, these components collectively contribute to an automated framework for structuring of text spaces with minimal human supervision in the era of large language models.

Graduation Semester

2025-12

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/132485

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Automated structuring of text space with minimal supervision

Zhang, Yunyi

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In