Files in this item



application/pdfMarina_Danilevsky.pdf (4MB)
(no description provided)PDF


Title:Discovering latent topical phrases in document collections and networks with text components: Leveraging text mining and information network analysis for human oriented applications
Author(s):Danilevsky, Marina
Director of Research:Han, Jiawei
Doctoral Committee Chair(s):Han, Jiawei
Doctoral Committee Member(s):Zhai, ChengXiang; Hockenmaier, Julia C.; Koh, Eunyee
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):topical phrases
topical hierarchy
mining topical keyphrases
topical community discovery
Abstract:One of the major challenges of mining topics from a large corpus is the quality of the constructed topics. While phrase-generating approaches generally produce high quality output, they do not scale very well with the size of the data. Thus, the state of the art solutions usually rely upon scalable unigram-generating methods, which do not produce high quality human-readable topics, or are forced to use external knowledge bases. Furthermore, while document collections naturally contain topics at different levels of granularity (general vs. specific), very few traditional methods focus on generating high quality hierarchical topic structures. This dissertation presents a series of approaches that directly addresses these challenges of generating high quality phrase-based topics, both as a flat set and organized as a hierarchy, as well as some potential applications. First, we describe a framework that generates high-quality topics represented by integrated lists of mixed-length phrases. The key is adapting a phrase-centric view towards the construction and ranking of topical phrases. The approach is domain-independent, and requires neither expert supervision nor an external knowledge base. The framework is initially constructed to work on collections of short texts, such as titles of scientific documents. However, we then show how the framework can be easily and robustly extended to work on collections of longer texts, and demonstrate its applicability to human needs with a task-centric evaluation. The dissertation then addresses the need to move beyond generating a flat set of topics, and present an approach to constructing hierarchical topics, which extends the phrase-centric approach to create high quality phrases at varying levels of granularity. Another application of this technique is then presented: the task of entity role discovery. By tying entities in a community to topical phrases, users are able to explicitly understand both how and why individual entities are ranked within a specific community. A final extension is then described, which is a combined approach for constructing the hierarchy, which uses entity link information to improve the hierarchy quality.
Issue Date:2014-05-30
Rights Information:Copyright 2014 Marina Danilevsky
Date Available in IDEALS:2014-05-30
Date Deposited:2014-05

This item appears in the following Collection(s)

Item Statistics