Files in this item



application/pdfMASSUNG-DISSERTATION-2017.pdf (529kB)
(no description provided)PDF


Title:Beyond topic-based representations for text mining
Author(s):Massung, Sean Alexander
Director of Research:Zhai, ChengXiang
Doctoral Committee Chair(s):Zhai, ChengXiang
Doctoral Committee Member(s):Hockenmaier, Julia; Roth, Dan; LaValle, Steve; Mei, Qiaozhu
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Text mining
Text representation
Feature representation
Natural language processing
Abstract:A massive amount of online information is natural language text: newspapers, blog articles, forum posts and comments, tweets, scientific literature, government documents, and more. While in general, all kinds of online information is useful, textual information is especially important—it is the most natural, most common, and most expressive form of information. Text representation plays a critical role in application tasks like classification or information retrieval since the quality of the underlying feature space directly impacts each task's performance. Because of this importance, many different approaches have been developed for generating text representations. By far, the most common way to generate features is to segment text into words and record their n-grams. While simple term features perform relatively well in topic-based tasks, not all downstream applications are of a topical nature and can be captured by words alone. For example, determining the native language of an English essay writer will depend on more than just word choice. Competing methods to topic-based representations (such as neural networks) are often not interpretable or rely on massive amounts of training data. This thesis proposes three novel contributions to generate and analyze a large space of non-topical features. First, structural parse tree features are solely based on structural properties of a parse tree by ignoring all of the syntactic categories in the tree. An important advantage of these "skeletons" over regular syntactic features is that they can capture global tree structures without causing problems of data sparseness or overfitting. Second, SyntacticDiff explicitly captures differences in a text document with respect to a reference corpus, creating features that are easily explained as weighted word edit differences. These edit features are especially useful since they are derived from information not present in the current document, capturing a type of comparative feature. Third, Cross-Context Lexical Analysis is a general framework for analyzing similarities and differences in both term meaning and representation with respect to different, potentially overlapping partitions of a text collection. The representations analyzed by CCLA are not limited to topic-based features.
Issue Date:2017-04-19
Rights Information:Copyright 2017 Sean Massung
Date Available in IDEALS:2017-08-10
Date Deposited:2017-05

This item appears in the following Collection(s)

Item Statistics