Files in this item



application/pdfLOURENTZOU-DISSERTATION-2019.pdf (4MB)Restricted to U of Illinois
(no description provided)PDF


Title:Data quality in the deep learning era: Active semi-supervised learning and text normalization for natural language understanding
Author(s):Lourentzou, Ismini
Director of Research:Zhai, ChengXiang
Doctoral Committee Chair(s):Zhai, ChengXiang
Doctoral Committee Member(s):Hockenmaier, Julia; Peng, Jian; Gruhl, Daniel
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):deep learning
machine learning
active learning
semi-supervised learning
text normalization
lexical normalizartion
sequence to sequence
relation extraction
neural networks
natural language processing
data quality
self-paced learning
Abstract:Deep Learning, a growing sub-field of machine learning, has been applied with tremendous success in a variety of domains, opening opportunities for achieving human level performance in many applications. However, Deep Learning methods depend on large quantities of data with millions of annotated instances. And while well-formed academic datasets have helped advance supervised learning research, in the real-word we are daily deluged by massive amounts of unstructured data, that remain unusable for current supervised learning approaches, as only a small portion is either labeled, cleaned or structured. In order for a machine learning model to be effective, volume is not the only data dimension that is necessary. Quality is equally important and has proven to be a critical factor for the success of industrial applications of machine learning. According to IBM, poor data quality can cost more than 3 trillion US dollars per year for the US market alone. Inspired by the need for advanced methods that can efficiently address such bottlenecks, we develop machine learning techniques can be leveraged to improve upon data quality in both data-related dimensions: input and output space. Having a set of labeled examples that can capture the task characteristics is one of the most important prerequisites for successfully applying machine learning. As such, we first focus on minimizing the annotation effort for any arbitrary user-defined task by exploring active learning methods. We show that the best performing active learning strategy depends on the task at-hand and we propose a combination of active learners, maximizing annotation performance early in the process. We demonstrate the viability of the approach on several relation extraction tasks. Next, we observe that even though our method can be used to speed up the collection of labeled training data, the rest will remain unlabeled and thus unexploited. Semi-supervised learning methods proposed in the literature can utilize additional unlabeled data, however, are typically compared on computer vision datasets such as CIFAR10. Here, we perform a systematic exploration of several semi-supervised methods for three sequence labeling tasks and two classification tasks. Additionally, most methods have assumptions that are less suitable to realistic scenarios. For example, proposed methods in the recent literature treat all unlabeled examples equally. Yet, in many cases we would like to sort out examples that might be less useful or confusing, particularly in noisy settings where examples with low training loss or high confidence are more likely to be clean examples. In addition, most methods assume that the unlabeled data can be classified into the same classes as the labeled data. This does not take into consideration the very possible scenario of out-of-class instances. For example, our classifier may be distinguishing cats from dogs, but the unlabeled examples may contain additional classes, such as shells, butterflies, etc. To this end, we design methods to mitigate these issues, with a re-weighting mechanism that can be incorporated to any consistency-based regularizer. Both active and semi-supervised learning methods aim to reduce labeling efforts by either automatically expanding the training set or selecting the most informative examples for human annotation. However, bootstrapping approaches often result in negative effects on NLP tasks due to the addition of falsely labeled instances. We address the challenge of producing good quality proxy labels, by leveraging the continuously growing stream of human annotations. We introduce a calibration of semi-supervised active learning where the confidence of the classifier is weighted by an auxiliary neural model that remove incorrectly labeled instances and dynamically adjusts the number of proxy labels included in each iteration. Experimental results show that our strategy outperforms baselines that combine traditional active learning with self-training. We have explored various ways on how to improve the output space of examples. But the input representation is also equally important. Particularly for social media, (the most abundant source of raw data nowadays) informal writing can cause several bottlenecks. For example, most Information Extraction (IE) tools rely on accurate understanding of text and struggle with the noisy and informal nature of social media due to high out-of-vocabulary (OOV) word rates. In this work, we design a social media text normalization hybrid word-character attention-based encoder-decoder model that can serve as a pre-processing step for any off-the-shelf NLP tool to adapt to social media noisy text. Our model surpasses baseline neural models designed for text normalization and achieves comparable performance with state-of-the-art related work. Although we evaluate on NLP tasks, all methods developed are fairly general and can be applied to other supervised machine learning tasks in need of techniques that create meaningful data representations and simultaneously reduce the burden and cost of human annotations.
Issue Date:2019-12-05
Rights Information:Copyright 2019 Ismini Lourentzou
Date Available in IDEALS:2020-03-02
Date Deposited:2019-12

This item appears in the following Collection(s)

Item Statistics