Withdraw
Loading…
Automated structuring of text space with minimal supervision
Zhang, Yunyi
Loading…
Permalink
https://hdl.handle.net/2142/132485
Description
- Title
- Automated structuring of text space with minimal supervision
- Author(s)
- Zhang, Yunyi
- Issue Date
- 2025-11-13
- Director of Research (if dissertation) or Advisor (if thesis)
- Han, Jiawei
- Doctoral Committee Chair(s)
- Han, Jiawei
- Committee Member(s)
- Abdelzaher, Tarek
- Tong, Hanghang
- Dong, Xin Luna
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Text Mining
- Natural Language Processing
- Information Retrieval
- Text Classification
- Large Language Models
- Abstract
- Our society has been immersed with massive unstructured text data, posing great challenges for people to fetch needed data, digest critical information, and derive actionable knowledge. The need for text space structuring has attracted a lot of research, but we are still far from solving the real problem. Recent advances in deep learning and large pre-trained language models have made great progress in natural language understanding. However, we still face some major challenges: (1) text classification still relies on substantial amount of labeled training data; (2) most current text classifiers confine to a small number (e.g., below 20) of single-layered, coarse-grained classes but the real need is at a fine-grained level; (3) existing methods are developed to classify text in a single dimension, but real-world applications often involves multiple orthogonal dimensions, like finding news articles according to topic, location, and time simultaneously. Only by structuring text according to each of these dimensions with sufficiently detailed taxonomy can we really generate ready-to-use structured knowledge from the unstructured text data. To bridge this gap, this dissertation aims to develop weakly supervised methods to structure the text space in a multi-granular and multi-aspect way. To accomplish this goal, the following tasks are studied. 1. Taxonomy Construction and Enrichment. Constructing a hierarchical representation of knowledge from textual data is a crucial first step towards a structured text space. While existing works rely on substantial human efforts on maintaining such a taxonomic structure, I will introduce several works that automatically build and enrich taxonomies, which serves as a preliminary of later parts. 2. Weakly-Supervised Text Classification. Given the class surface names as the only supervision, the weakly-supervised text classification task aims to train a text classifier that can tag each document with one or more classes. I developed two methods in this direction: PIEClass is a weakly-supervised flat text classification method which proposes a noise-robust self-training method by combining different fine-tuning methods of pre-trained language models, and TELEClass is a weakly-supervised hierarchical text classification method that explores how large generative model can understand large hierarchical label spaces and combine its power with corpus-based knowledge. 3. Text Classification with Temporal Information. Classifying text in one dimension often is not sufficient for digesting the text space. Therefore, I further study text classification in the temporal dimension. The event discovery task aims to find clusters of news articles that are thematically similar and temporally close to each other, which likely indicate a real-world event happening. I will introduce EvMine, which first identifies peak phrases in the temporal dimension as candidate key events and then classify relevant documents for each event. 4. Application of Structured Text Spaces: Scientific Paper Retrieval. After constructing a structured text space, we would like to study how it can benefit an essential downstream application, scientific paper retrieval. I propose SemRank, a plug-and-play ranking method that combines large language models with structured text corpora in the form of semantic indexes to improve the retrieval performance of base retrievers. Overall, these components collectively contribute to an automated framework for structuring of text spaces with minimal human supervision in the era of large language models.
- Graduation Semester
- 2025-12
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/132485
- Copyright and License Information
- Copyright 2025 Yunyi Zhang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…