Withdraw
Loading…
Weakly supervised text mining with text-rich networks
Zhang, Xinyang
Loading…
Permalink
https://hdl.handle.net/2142/125539
Description
- Title
- Weakly supervised text mining with text-rich networks
- Author(s)
- Zhang, Xinyang
- Issue Date
- 2024-06-27
- Director of Research (if dissertation) or Advisor (if thesis)
- Han, Jiawei
- Doctoral Committee Chair(s)
- Han, Jiawei
- Committee Member(s)
- Sundaram, Hari
- Tong, Hanghang
- Dong, Xin Luna
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Language models
- text mining
- text-rich networks
- weak supervision
- Abstract
- The advent of the information age has brought about an unprecedented surge in the volume of data available at our fingertips. A significant portion of this data is manifested in the form of semi-structured text corpora, which are characterized by the interplay between unstructured text and structured metadata. These corpora are ubiquitous across various domains, presenting unique challenges for knowledge discovery and data mining. For instance, social media platforms are a complex blend of users, user-generated content, and the intricate web of interactions between them. Similarly, academic publication databases comprise a rich tapestry of published content, author information, and venue details. The sheer scale and heterogeneity of these semi-structured text corpora necessitate the development of intelligent and efficient techniques to extract, organize, and summarize the valuable knowledge embedded within them. Traditional approaches to text mining often rely heavily on extensive human annotation or manually curated knowledge bases, which are not only time-consuming and expensive but also difficult to adapt to new domains. To fully harness the potential of these vast information resources, we propose a novel approach that leverages the inherent structure of the data itself. We introduce a framework based on text-rich networks, which effectively integrates unstructured documents with structured metadata to create a comprehensive representation for data mining. By interconnecting documents based on shared attributes, such as linking research papers authored by the same individual, we construct a unified network that captures both the textual content and the relationships between entities. This approach opens up new possibilities for knowledge discovery, enabling us to uncover valuable patterns and insights that may be difficult to detect using traditional methods. The text-rich network provides a natural encoding of document relevance, allowing us to develop weakly supervised text mining applications that require minimal human intervention. This methodology reduces the reliance on large amounts of labeled data, making it more scalable and adaptable to various domains. Through the use of text-rich networks, we aim to enhance the efficiency and effectiveness of knowledge discovery in semi-structured text corpora, facilitating data-driven insights and innovation in the field of text mining. My research consists of two main areas of investigation: (1) construction and consolidation of a text-rich network, and (2) mining of a text-rich network. In the first area, we focus on building a robust text-rich network structure that relies on the availability of structured information alongside text. When such information is unavailable or incomplete, we investigate weakly supervised methods to extract structured information. 1. Open-World Attribute Value Extraction: We propose a novel approach for open-world attribute value extraction, which enables us to identify entities and attributes that facilitate the construction of a comprehensive text-rich network. This method addresses the challenges of incomplete or missing structured information in semi-structured text corpora. In the second area, we explore various methods built on top of a fully constructed text-rich network for fundamental text representation and weakly supervised text mining applications. 2. Language Model Pre-training with Text-Rich Networks: We introduce a novel language model pre-training approach that utilizes the text-rich network structure to capture more meaningful representations of text data. By incorporating the network’s contextual information during the pre-training process, we aim to improve the effectiveness of downstream natural language processing tasks. 3. Text-Rich Network-Driven Weakly Supervised Text Classification: We develop a framework that leverages the rich structural information embedded in the text-rich network to enhance the performance of text classification tasks. This approach reduces the reliance on large amounts of labeled data, making it more adaptable to real-world scenarios. Together, these components create a cohesive framework for weakly supervised text mining with text-rich networks. Our open-world attribute value extraction method strengthens the construction and consolidation of text-rich networks, while our text classification and language model pre-training approaches demonstrate the power of mining these networks for various applications. By integrating these techniques, we establish a comprehensive methodology that enables efficient and effective knowledge discovery in semi-structured text corpora, reducing the reliance on human annotation and making it more adaptable to real-world scenarios.
- Graduation Semester
- 2024-08
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/125539
- Copyright and License Information
- Copyright 2024, Xinyang Zhang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…