Files in this item



application/pdfSHEN-DISSERTATION-2021.pdf (6MB)Restricted to U of Illinois
(no description provided)PDF


Title:Automated taxonomy discovery and exploration
Author(s):Shen, Jiaming
Director of Research:Han, Jiawei
Doctoral Committee Chair(s):Han, Jiawei
Doctoral Committee Member(s):Ji, Heng; Zhai, ChengXiang; Vanni, Michelle T.
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Data Mining
Natural Language Processing
Abstract:In an era of information explosion, people are inundated with vast amounts of text data. Every day, there are thousands of scientific papers, tens of thousands of news articles, corporate reports, and millions of social media posts produced and shared worldwide. Turning those massive text data into actionable knowledge is an essential research issue in data science and lays the foundation for realizing machine intelligence. The goal of my research is to unleash hidden knowledge buried in unstructured text. To bring this vision to reality, I propose to first structure raw text using taxonomies and then analyze structured text in a more fine-grained and semantic way. Due to the diversity of application scenarios, different corpora or different use cases may call for different taxonomies. For example, one analyst aiming to find experts in different scientific areas may want a field-of-study taxonomy, while another analyst who studies the technology readiness may call for a taxonomy capturing technology dependencies. Moreover, even within one taxonomy, we also enable users to organize concepts at their will, such as with different levels containing concepts of different categories. For instance, in a computer science taxonomy, top levels could be about the field of studies, intermediate levels may discuss research tasks, and the bottom levels can cover evaluation metrics. Asking human experts to manually curate those taxonomies, one for every possible application, is time-consuming, costly, and unscalable. Therefore, we propose to automatically discover and explore taxonomies based on the datasets and applications, with critical but minimal human guidance. This thesis outlines a data-driven approach that automatically constructs, enriches, and applies taxonomies for unleashing knowledge from massive unstructured text. Particularly, we investigate four areas of research, including: (1) Identifying Concept Sets. To obtain concept nodes in the taxonomy, we first develop a collection of concept set expansion methods [1, 2] to extract concepts from text corpora by expanding a small set of seed concepts into a complete list of concepts that belong to the same semantic class. (2) Recognizing Taxonomic Relations. To organize the above-identified concepts into a hierarchical structure, we propose a set of taxonomy construction methods [3, 4] to discover taxonomic relations among concepts by analyzing example relation instances (i.e., concept pairs indicating the target relation semantics) and utilizing distant supervision from existing, open-domain knowledge bases. (3) Enriching Existing Taxonomies. As human knowledge is constantly growing, a static taxonomy may fail to capture emerging user needs. Thus, a taxonomy enrichment step would be essential to keep our taxonomies up-to-date in real-world applications. We facilitate this process by expanding the taxonomy to incorporate new concepts [5, 6, 7]. (4) Empowering Knowledge-centric Applications. After an up-to-date taxonomy is obtained, we develop principled methods to distill knowledge from taxonomies for downstream applications such as text categorization [8, 9] and intelligent literature search [10, 11]. Finally, we explore how to incorporate event knowledge into the taxonomy by automatically detecting event types from a given corpus. Together, these pieces constitute an integrated framework for leveraging taxonomies to convert massive text data into actionable knowledge.
Issue Date:2021-12-02
Rights Information:Copyright 2021 Jiaming Shen
Date Available in IDEALS:2022-04-29
Date Deposited:2021-12

This item appears in the following Collection(s)

Item Statistics