Files in this item



application/pdfMISHRA-DISSERTATION-2020.pdf (6MB)
(no description provided)PDF


Title:Information extraction from digital social trace data with applications to social media and scholarly communication data
Author(s):Mishra, Shubhanshu
Director of Research:Diesner, Jana
Doctoral Committee Chair(s):Diesner, Jana
Doctoral Committee Member(s):Torvik, Vetle I; Karahalios, Karrie G; Brunner, Robert J
Department / Program:Information Sciences
Discipline:Library & Information Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Social Media Analysis
Machine Learning
Data Mining
Scholarly Data Analysis
Digital Libraries
Computer Science
Information Science
Open Source
Multi task learning
Deep Learning
Active Learning
Natural Language Processing
Big Data Analysis
Abstract:Information extraction (IE) aims at extracting structured data from unstructured or semi-structured data. The thesis starts by identifying social media data and scholarly communication data as a special case of digital social trace data (DSTD). This identification allows us to utilize the graph structure of the data (e.g., user connected to a tweet, author connected to a paper, author connected to authors, etc.) for developing new information extraction tasks. The thesis focuses on information extraction from DSTD, first, using only the text data from tweets and scholarly paper abstracts, and then using the full graph structure of Twitter and scholarly communications datasets. This thesis makes three major contributions. First, new IE tasks based on DSTD representation of the data are introduced. For scholarly communication data, methods are developed to identify article and author level novelty and expertise. Furthermore, interfaces for examining the extracted information are introduced. A social communication temporal graph (SCTG) is introduced for comparing different communication data like tweets tagged with sentiment, tweets about a search query, and Facebook group posts. For social media, new text classification categories are introduced, with the aim of identifying enthusiastic and supportive users, via their tweets. Additionally, the correlation between sentiment classes and Twitter meta-data in public corpora is analyzed, leading to the development of a better model for sentiment classification. Second, methods are introduced for extracting information from social media and scholarly data. For scholarly data, a semi-automatic method is introduced for the construction of a large-scale taxonomy of computer science concepts. The method relies on the Wikipedia category tree. The constructed taxonomy is used for identifying key computer science phrases in scholarly papers, and tracking their evolution over time. Similarly, for social media data, machine learning models based on human-in-the-loop learning, semi-supervised learning, and multi-task learning are introduced for identifying sentiment, named entities, part of speech tags, phrase chunks, and super-sense tags. The machine learning models are developed with a focus on leveraging all available data. The multi-task models presented here result in competitive performance against other methods, for most of the tasks, while reducing inference time computational costs. Finally, this thesis has resulted in the creation of multiple open source tools and public data sets, which can be utilized by the research community. The thesis aims to act as a bridge between research questions and techniques used in DSTD from different domains. The methods and tools presented here can help advance work in the areas of social media and scholarly data analysis. All resources related to this thesis are available at
Issue Date:2020-05-03
Rights Information:Copyright 2020 by Shubhanshu Mishra. All rights reserved.
Date Available in IDEALS:2020-08-26
Date Deposited:2020-05

This item appears in the following Collection(s)

Item Statistics