Files in this item



application/pdfEL-KISHKY-DISSERTATION-2020.pdf (1MB)Restricted to U of Illinois
(no description provided)PDF


Title:Text mining at multiple granularity: leveraging subwords, words, phrases, and sentences
Author(s):El-Kishky, Ahmed Hassan
Director of Research:Han, Jiawei
Doctoral Committee Chair(s):Han, Jiawei; Zhai, ChengXiang; Abdelzaher, Tarek; Zhang, Joy
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
data mining
nlp, cross-lingual
Abstract:With the rapid digitization of information, large quantities of text-heavy data is being constantly generated in many languages and across domains such as web documents, research papers, business reviews, news, and social posts. As such, efficiently and effectively searching, organizing, and extracting meaningful information and data from these massive unstructured corpora is essential to laying the foundation for many downstream text mining and natural language processing (NLP) tasks. Traditionally, NLP and text mining techniques are applied to the raw texts while treating individual words as the base semantic unit. However the assumption that individual word-tokens are the correct semantic granularity does not hold for many tasks and can lead to many problems and poor task performance. To address this, this work introduces techniques for identifying and utilizing text at different semantic granularity to solve a variety of text mining and NLP tasks. The general idea is to take a text object such as a document, and decompose it to many levels of semantic granularity such as sentences, phrases, words, or subword structures. Once the text in represented at different levels of semantic granularity, we demonstrate techniques that can leverage the properly encoded text to solve a variety of NLP tasks. Specifically, this study focuses on three levels of semantic granularity: (1) subword segmentation with an application to enriching word embeddings to address word sparsity (2) phrase mining with an application to phrase-based topic modeling and (3) leveraging sentence-level granularity for finding parallel cross-lingual data. The first granularity we study is subword-level. We introduce a subword mining problem that aims to segment individual word tokens into smaller subword structures. The motivation is that, often, individual words are too coarse of a granularity and need to be supplemented by a finer semantic granularity. Operating on these fine-grained subwords addresses many important problems in NLP namely the long-tail data-sparsity problem whereby most words in a corpus are infrequent, and the more severe out-of-vocabulary problem. To effectively and efficiently mine these subword structures, we propose an unsupervised segmentation algorithm based off a novel objective: transition entropy. We use ground-truth segmentation to assess the quality of the segmented words and further demonstrate the benefit of jointly leveraging words and subwords for distributed word representations. The second granularity we study is phrase-level and the phrase mining task to transform raw unstructured text from a fine-grained sequence of words into a coarser-granularity sequence of single and multi-word phrases. The motivation is that, often, human language contains idiomatic multi-word expressions and fine-grained words fail to capture the right semantic granularity; proper phrasal segmentation can capture this true appropriate semantic granularity. To address this problem, we propose an unsupervised phrase mining algorithm based on frequent significant contiguous text patterns. We use human-evaluation to assess the quality of the mined phrases and demonstrate the benefit of pre-mining phrases on a downstream topic-modeling task. The third granularity we study is sentence-level granularity. We motivate the need for a sentence-level granularity for capturing more complex semantically complete spans of texts. We introduce several downstream tasks that leverage sentence representations in conjunction with finer-grained units in a cross-lingual text mining task. We experimentally show how leveraging sentence-level data for cross-lingual embeddings can be used to identify cross-lingual document pairs and parallel sentences – data necessary for training machine translation models.
Issue Date:2020-05-06
Rights Information:Copyright 2020 Ahmed El-Kishky
Date Available in IDEALS:2020-08-26
Date Deposited:2020-05

This item appears in the following Collection(s)

Item Statistics