Completing the knowledge lifecycle for language models

Zhang, Zixuan

Completing the knowledge lifecycle for language models

Zhang, Zixuan

Permalink

https://hdl.handle.net/2142/125564

Description

Title

Completing the knowledge lifecycle for language models

Author(s)

Zhang, Zixuan

Issue Date

2024-07-08

Director of Research (if dissertation) or Advisor (if thesis)

Ji, Heng

Doctoral Committee Chair(s)

Ji, Heng

Committee Member(s)

Zhai, Chengxiang
Tong, Hanghang
Small, Kevin
Yih, Scott Wen-tau

Department of Study

Siebel Computing &DataScience

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Language Model
Knowledge And Language

Language

eng

Abstract

In recent years, language models (LMs) have achieved remarkable success, driven by the highly effective Transformer architecture. Supported by in-context learning and language model scaling laws, we can transform nearly all natural language processing (NLP) tasks into language modeling problems, and then develop exceptionally large and powerful language models to solve them. However, current language models still face critical challenges, such as the issue of hallucination and the difficulty of training these models to adapt to frequent updates in knowledge. In this dissertation, we identify two main reasons as the root causes of these challenges: implicit knowledge representation and sparse knowledge distribution. First, the knowledge is represented implicitly as parameters in language models, which makes it challenging to identify, locate, and edit a specific piece of knowledge inside a language model. Additionally, real-world new knowledge is distributed in the huge amount of unstructured data sparsely, making it difficult to extract and saturate useful knowledge without an effective knowledge extraction system. In summary, significant flaws still exist despite the great success and achievements of the language model related research. In this thesis, we aim to tackle these challenges by completing a knowledge lifecycle, to establish a knowledge-oriented updating cycle for existing language models, enabling them to continuously improve by extracting new knowledge, eliciting existing internal knowledge, and subsequently integrating and updating the model itself. We propose three steps to enable the self-improvement ability of language models: Knowledge Extraction and Summarization, Knowledge Elicitation from Language Models, and Knowledge Update and Integration. First, we need to develop a powerful extraction system capable of extracting and summarizing useful knowledge from the vast amounts of real-world data. After this extraction, we create algorithms that can efficiently draw out the existing implicit knowledge embedded within language models. Finally, we merge this elicited knowledge with the newly extracted information, manage updates and resolve conflicts, and finally integrate the refined knowledge back to the model to complete an updating cycle. We have made a number of novel contributions given the goal of building such a kind of self-updating cycle. For extracting structured knowledge from textual documents, we build RESIN, the first cross-lingual multi-document multi-media information extraction system, that is able to process hundreds of multi-media document clusters at a scale and generate high-quality event graphs. For knowledge elicitation, we propose sparse latent typing (SLT), a pre-training objective designed to encourage language models to develop a structured understanding of the pre-training text. We also study the knowledge update problem in both semi-parameteristic (retrieval-augmented generation, RAG) and fully-parameteristic settings, and propose effective methods to improve the model’s generalizability upon knowledge updates by mitigating knowledge over-memorization. Extensive experiments have been conducted, and the results demonstrate that the language model can be significantly improved through this knowledge updating cycle.

Graduation Semester

2024-08

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/125564

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Completing the knowledge lifecycle for language models

Zhang, Zixuan

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In