Withdraw
Loading…
An ontology-guided, language model-assisted approach for theme-specific information extraction
Xiao, Jinfeng
Loading…
Permalink
https://hdl.handle.net/2142/129450
Description
- Title
- An ontology-guided, language model-assisted approach for theme-specific information extraction
- Author(s)
- Xiao, Jinfeng
- Issue Date
- 2025-04-25
- Director of Research (if dissertation) or Advisor (if thesis)
- Han, Jiawei
- Doctoral Committee Chair(s)
- Han, Jiawei
- Committee Member(s)
- Zhai, ChengXiang
- Ji, Heng
- Elkaref, Mohab
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Large language models
- Information extraction
- Information retrieval
- Ontology
- Taxonomy
- Abstract
- In an era of information explosion, text data are generated every second so fast that digesting raw texts becomes increasingly time-consuming. The research field of text mining aims at automatically extracting structured knowledge from massive unstructured text data. An essential step in the process of converting text to knowledge is information extraction (IE), which aims at extracting information about entities and relations from text. In recent years, the development of large language models (LLMs) has shed light on new possibilities of performing better IE under label-sparse settings that are hard, if not impossible, for traditional methods due to the lack of human supervision that is necessary for model training. On the other hand, LLM-assisted domain-specific IE applications often face long-tail challenges, which refer to suboptimal performance due to the fact that many important signals (e.g., entities and relations) in the target domain are sparse in the pre-training corpora and are thus not handled well by pre-trained LLMs. Such long-tail problems become more severe when we move one step further from domains into highly specialized themes. My dissertation research explores the utilization of theme-specific entity type ontologies to improve the long-tail robustness of LLMs for IE tasks. I utilize such ontologies to 1) retrieve theme-relevant documents from large corpora based on the distribution of query entities and relations on the ontology, 2) find entities from large corpora that have fine-grained entity types relevant to the query entities, 3) retrieve relevant external, non-parametric knowledge in a structured manner to augment LLM queries, and 4) organize the internal, parametric knowledge of LLMs into relation structures for robust open relation extraction. While the 1st and 3rd components focus more on information retrieval (IR), all the components together form an end-to-end framework for corpus-level theme-specific IE. Extensive experiments demonstrate that my solutions achieve state-of-the-art performance with remarkable long-tail robustness. One more notable advantage of my approach is that it works purely at the LLM inference stage. No extra pre-training or post-training is required. This means my approach works without additional annotation efforts or heavy resource consumption. All my methods are easy to deploy on a single graphics processing unit (GPU) with any recent LLMs, and thus, my solutions are highly accessible to ordinary people. In addition, my design essentially separates data from models, which ensures there is technically zero risk of leaking private user data. Thus, my approach is highly suitable for sensitive domain applications such as medical care and legal advising.
- Graduation Semester
- 2025-05
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129450
- Copyright and License Information
- Copyright 2025 Jinfeng Xiao
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…