Constructing and mining structured heterogeneous information networks from massive text corpora

Shang, Jingbo

Constructing and mining structured heterogeneous information networks from massive text corpora

Shang, Jingbo

Permalink

https://hdl.handle.net/2142/106218

Description

Title

Constructing and mining structured heterogeneous information networks from massive text corpora

Author(s)

Shang, Jingbo

Issue Date

2019-11-26

Director of Research (if dissertation) or Advisor (if thesis)

Han, Jiawei

Doctoral Committee Chair(s)

Han, Jiawei

Committee Member(s)

Abdelzaher, Tarek
Peng, Jian
Korn, Flip

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Date of Ingest

2020-03-02T21:58:17Z

Keyword(s)

AutoNet
Phrase Mining
Entity Recognition
Topic Taxonomy
Automated
Distant Supervision

Abstract

In today's information society, we are soaked with overwhelming amounts of natural-language text data, ranging from news articles and social media posts to research literature, medical records, and corporate reports. A grand challenge for data miners is to develop effective and scalable methods to mine such massive unstructured text corpora to discover hidden structures and generate structured heterogeneous information networks, from which actionable knowledge can be generated based on user's need. There are three major questions as follows. Can machines automatically ``digest'' a given (domain-specific) corpus and identify real-world entities and their relations mentioned in the corpus? Can human experts efficiently understand and consume the sophisticated, gigantic structured networks constructed by machines? Can such machine-extracted information benefit downstream applications in various fields? The massive and messy nature of text data poses significant challenges to creating techniques for automatic processing and algorithmic analysis of contents that scale with text volume. State-of-the-art information extraction approaches rely on heavy task-specific annotations (e.g., annotating terrorist attack-related entities in web forum posts written in Arabic) to build (deep) machine learning models. In contrast, our research harnesses ``the power of massive data'' and develops a family of data-driven approaches for automatic knowledge discovery. Our methods, to alleviate the need for heavy human annotation, utilize distant supervision from existing, open knowledge bases and statistical signals (e.g., frequency and point-wise mutual information) based on massive corpora. Such approaches are therefore general, extensible to texts corpora in multiple languages and across multiple domains. The goal of our research is to create general data-driven methods to transform text data of various kinds into structured databases of human knowledge. This thesis outlines an automated framework, AutoNet, which focuses on automatically extracting structured networks of entities and relations embedded in a large-scale text corpus. In addition, it constructs a high-quality topic taxonomy for more efficient human explorations. The key philosophy of ``automatic'' here is to extract high-quality structured knowledge and insights with little human effort. Specifically, it first identifies corpus-wide high-quality phrases. From high-quality phrases, we distinguish typed entities and relational phrases, and further connect entities by relational phrases. In this way, we organize the entities and relations as heterogeneous information networks. Such networks extracted from massive text corpora are typically of gigantic sizes --- millions of nodes and billions of edges would be common scenarios. Therefore, after the networks are constructed, we propose to construct a topic taxonomy to make human explorations more efficient. The topic taxonomy organizes the network in a structured way, so human experts can have a bird-eye view of the whole network and easily drill down to the particular sub-network of interests. We attempt to make the whole AutoNet framework automated (i.e., saves human annotation effort), robust (i.e., is effective across multiple languages and domains), and scalable (i.e., works for web-scale input). In this thesis, we mainly cover automated, robust, and scalable models and real-world applications in three main problems, (1) Mining High-Quality Phrases. We first show how to unify multiple statistical signals to estimate the phrase quality using a classifier based on weak supervision. And then, we improve the accuracy of quality estimation by rectifying the frequencies based on phrasal segmentation results. We further demonstrate that some public knowledge bases (e.g., Wikipedia) can replace the weak supervision and even lead to better results. Such phrase mining methods are purely data-driven, thus being domain-agnostic and language-independent. (2) Recognizing Named Entities. Recent advances in deep neural models for named entity recognition have freed human effort from handcrafting features. Moving one step further, we show that using existing entity dictionaries (i.e., entity type, entity name, and some synonyms) can achieve competitive entity recognition performance as state-of-the-art supervised methods. We believe such distantly supervised entity recognition models can serve as initial deployments in various applications, and provide a solid foundation for active learning and further human annotations. It could save tremendous human effort. (3) Building Topic Taxonomies. Different from existing methods using text data or (extracted) network structures separately, we propose to let the text collaborate with network structures. Specifically, we combine these two types of data as text-rich networks, and then construct a topic taxonomy to obtain a holistic view of all data. We employ motif patterns (i.e., subgraph patterns at the type schema level) to represent information from networks, and further conduct an instance-level selection to choose relevant information. Based on textual contexts and selected motif instances, we learn term embedding jointly from text and network, and then obtain term clusters as taxonomy nodes. Therefore, the constructed taxonomy is more accurate than those built based on text/network only.

Graduation Semester

2019-12

Type of Resource

text

Permalink

http://hdl.handle.net/2142/106218

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Constructing and mining structured heterogeneous information networks from massive text corpora

Shang, Jingbo

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In