Files in this item



application/pdfTAO-DISSERTATION-2017.pdf (8MB)
(no description provided)PDF


Title:Text cube: construction, summarization and mining
Author(s):Tao, Fangbo
Director of Research:Han, Jiawei
Doctoral Committee Chair(s):Han, Jiawei
Doctoral Committee Member(s):Zhai, ChengXiang; Peng, Jian; Wang, Haixun
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Text cube
Data cube
Data mining
Natural language processing
Text classification
Abstract:A large portion of real world data is either text or structured (\eg, relational) data. Such data objects are often linked together (\eg, structured product information linking with their descriptions and customer reviews.). To systematically analyze large numbers of such textual documents, it is often desirable to manage the text data with the associated structured data in a multi-dimensional space (hence \emph{text cube}). This thesis studies the multi-dimensional representation of large textual data. Since Jim Gray introduced the concept of ``data cube'', data cube, associated with online analytical processing (OLAP), has become a driving engine in data warehouse industry. By modeling a large textual corpus as a ``cube'', \ie multi-dimensional and hierarchical structure, we bridge the power of traditional OLAP and Information Retrieval / Natural Language Processing techniques. In particular, this thesis focuses on two lines of work, one is to construct a multi-dimensional text cube from raw text data with limited user guidance; the other is to develop effective summarization and mining techniques tailored for multi-dimensional queries on text cubes. In the first part of the thesis, the problem of \emph{dimension-based structure creation} is studied. We propose an end-to-end framework for extracting multi-dimensional structure from a corpus, taking the input of a corpus of specific domain and limited seeds to generate a high-quality dimension values as output. We introduce the novel concept of Semantic Pattern Graph to leverage web signals to understand the underlying semantics of lexical patterns, improve pattern evaluation using mined semantics, and yield more accurate and complete structure. Experiments show the effectiveness of our approach. In the second part, with all the dimensions discovered, we study the problem of \emph{cell-based document allocation}. That is, linking the created dimensions with text data and construct a multi-dimensional text cube. To allocate documents into correct multi-dimensional subsets, \ie a cell. Traditional approaches, in this particular task, may require substantial labeling from user. Instead, we propose a model that requires no additional training data besides the given (label) name of each cube dimension as weak supervision. With such weak supervision, we develop a \emph{dimension-aware joint embedding} framework that learns joint representations for terms, documents, and labels. In the joint embedding process, our method iteratively learns dimension-aware document representations by selectively focusing on discriminative keywords for different dimensions. Furthermore, it alleviates label sparsity by leveraging label representations to enrich the labeled term set. Numerical experiments corroborate the effectiveness of our solution. In the third part, we introduce the concept of \emph{Context-Aware Semantic Online Analytical Processing} (\ie \emph{CASeOLAP}) in text cubes, and use \emph{top-$k$ representative phrases} to represent the semantics of the document subset in a text cube cell. By ranking phrases with a newly proposed ranking measure according to three criteria: integrity, popularity and distinctiveness. We identify phrases that can successfully digest the main content of a subset of documents of interest and contrast with other neighboring subsets. Our experiments in a large news dataset demonstrate the effectiveness of the newly proposed ranking measure in finding representative phrases and the efficiency in both query processing time and storage cost. The approach is also applied to clinical biomarker analysis and protest news analysis with success. In the last part, the system of \emph{EventCube} is proposed to support end-to-end pipeline of text cube in an informative, interactive, and user-friendly manner. The system serves as a general platform for construction, search, summarization, OLAP (online analytical processing) and data mining on integrated text and structured data. The system is a growing testbed for various text cube based research and has been successfully applied to NASA for aviation safety report analysis and Army Research Lab for Counter-Terrorism Report analysis. To summarize, this thesis provides important results of construction and consumption of multi-dimensional text cubes and shows its power in tackling real-world text analysis tasks.
Issue Date:2017-12-06
Rights Information:Copyright 2017 Fangbo Tao
Date Available in IDEALS:2018-03-13
Date Deposited:2017-12

This item appears in the following Collection(s)

Item Statistics