Files in this item

FilesDescriptionFormat

application/pdf

application/pdfGUPTA-THESIS-2019.pdf (3MB)Restricted to U of Illinois
(no description provided)PDF

Description

Title:The surprising effectiveness of explicit semantic analysis in dataless classification
Author(s):Gupta, Shashank
Advisor(s):Roth, Dan
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:M.S.
Genre:Thesis
Subject(s):ESA
Dataless Classification
Embeddings
Unsupervised Learning
EntityESA
Entity2Vec
Topic2Vec
Word2Concept
Abstract:Organizing textual content into broad labels is one of the most basic tasks that some people carry out on a regular basis. This simple task helps people navigate through large document collections by exposing the labels of the documents, which can then be used for selecting the documents of interest. Currently, the most popular techniques for providing this basic functionality are supervised in nature, wherein someone has to annotate a collection of documents with the labels of interest. However, it might not always be possible to create a sizeable labeled dataset for every scenario or domain of interest. Thus, techniques like “Dataless Classification” have been proposed in the past that are able to bootstrap the creation of a classifier by only requiring semantic descriptions of the labels. However, despite the encouraging performance of Dataless Classification on Text Classification tasks, there is still a room for large improvement. In this thesis, we identify the limitations of ESA-driven Dataless Classification and systematically design techniques for addressing each limitation. In the process, we end up developing 4 new embeddings – EntityESA, Entity2Vec, Topic2Vec and Word2Concept. However, despite our best efforts, we found it difficult to outperform the original Dataless Classification system. For some of the techniques we provide an explanation for this observed behavior, however we also attribute some of these observations to the datasets that are being used for evaluation purposes. We then propose a way to create a new dataset that can used for future Dataless evaluations. The new embedding methods proposed in this work are generic enough that they can be of independent interest as well.
Issue Date:2019-07-16
Type:Text
URI:http://hdl.handle.net/2142/105826
Rights Information:Copyright 2019 Shashank Gupta
Date Available in IDEALS:2019-11-26
Date Deposited:2019-08


This item appears in the following Collection(s)

Item Statistics