Files in this item



application/pdfSHANG-DISSERTATION-2019.pdf (8MB)
(no description provided)PDF


Title:Constructing and mining structured heterogeneous information networks from massive text corpora
Author(s):Shang, Jingbo
Director of Research:Han, Jiawei
Doctoral Committee Chair(s):Han, Jiawei
Doctoral Committee Member(s):Abdelzaher, Tarek; Peng, Jian; Korn, Flip
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Phrase Mining
Entity Recognition
Topic Taxonomy
Distant Supervision
Abstract:In today's information society, we are soaked with overwhelming amounts of natural-language text data, ranging from news articles and social media posts to research literature, medical records, and corporate reports. A grand challenge for data miners is to develop effective and scalable methods to mine such massive unstructured text corpora to discover hidden structures and generate structured heterogeneous information networks, from which actionable knowledge can be generated based on user's need. There are three major questions as follows. Can machines automatically ``digest'' a given (domain-specific) corpus and identify real-world entities and their relations mentioned in the corpus? Can human experts efficiently understand and consume the sophisticated, gigantic structured networks constructed by machines? Can such machine-extracted information benefit downstream applications in various fields? The massive and messy nature of text data poses significant challenges to creating techniques for automatic processing and algorithmic analysis of contents that scale with text volume. State-of-the-art information extraction approaches rely on heavy task-specific annotations (e.g., annotating terrorist attack-related entities in web forum posts written in Arabic) to build (deep) machine learning models. In contrast, our research harnesses ``the power of massive data'' and develops a family of data-driven approaches for automatic knowledge discovery. Our methods, to alleviate the need for heavy human annotation, utilize distant supervision from existing, open knowledge bases and statistical signals (e.g., frequency and point-wise mutual information) based on massive corpora. Such approaches are therefore general, extensible to texts corpora in multiple languages and across multiple domains. The goal of our research is to create general data-driven methods to transform text data of various kinds into structured databases of human knowledge. This thesis outlines an automated framework, AutoNet, which focuses on automatically extracting structured networks of entities and relations embedded in a large-scale text corpus. In addition, it constructs a high-quality topic taxonomy for more efficient human explorations. The key philosophy of ``automatic'' here is to extract high-quality structured knowledge and insights with little human effort. Specifically, it first identifies corpus-wide high-quality phrases. From high-quality phrases, we distinguish typed entities and relational phrases, and further connect entities by relational phrases. In this way, we organize the entities and relations as heterogeneous information networks. Such networks extracted from massive text corpora are typically of gigantic sizes --- millions of nodes and billions of edges would be common scenarios. Therefore, after the networks are constructed, we propose to construct a topic taxonomy to make human explorations more efficient. The topic taxonomy organizes the network in a structured way, so human experts can have a bird-eye view of the whole network and easily drill down to the particular sub-network of interests. We attempt to make the whole AutoNet framework automated (i.e., saves human annotation effort), robust (i.e., is effective across multiple languages and domains), and scalable (i.e., works for web-scale input). In this thesis, we mainly cover automated, robust, and scalable models and real-world applications in three main problems, (1) Mining High-Quality Phrases. We first show how to unify multiple statistical signals to estimate the phrase quality using a classifier based on weak supervision. And then, we improve the accuracy of quality estimation by rectifying the frequencies based on phrasal segmentation results. We further demonstrate that some public knowledge bases (e.g., Wikipedia) can replace the weak supervision and even lead to better results. Such phrase mining methods are purely data-driven, thus being domain-agnostic and language-independent. (2) Recognizing Named Entities. Recent advances in deep neural models for named entity recognition have freed human effort from handcrafting features. Moving one step further, we show that using existing entity dictionaries (i.e., entity type, entity name, and some synonyms) can achieve competitive entity recognition performance as state-of-the-art supervised methods. We believe such distantly supervised entity recognition models can serve as initial deployments in various applications, and provide a solid foundation for active learning and further human annotations. It could save tremendous human effort. (3) Building Topic Taxonomies. Different from existing methods using text data or (extracted) network structures separately, we propose to let the text collaborate with network structures. Specifically, we combine these two types of data as text-rich networks, and then construct a topic taxonomy to obtain a holistic view of all data. We employ motif patterns (i.e., subgraph patterns at the type schema level) to represent information from networks, and further conduct an instance-level selection to choose relevant information. Based on textual contexts and selected motif instances, we learn term embedding jointly from text and network, and then obtain term clusters as taxonomy nodes. Therefore, the constructed taxonomy is more accurate than those built based on text/network only.
Issue Date:2019-11-26
Rights Information:Copyright 2019 Jingbo Shang
Date Available in IDEALS:2020-03-02
Date Deposited:2019-12

This item appears in the following Collection(s)

Item Statistics