Multi-dimensional mining of unstructured data with limited supervision

Zhang, Chao

Multi-dimensional mining of unstructured data with limited supervision

Zhang, Chao

Permalink

https://hdl.handle.net/2142/102465

Description

Title

Multi-dimensional mining of unstructured data with limited supervision

Author(s)

Zhang, Chao

Issue Date

2018-12-03

Director of Research (if dissertation) or Advisor (if thesis)

Han, Jiawei

Doctoral Committee Chair(s)

Han, Jiawei

Committee Member(s)

Zhai, ChengXiang
Abdelzaher, Tarek
Mei, Qiaozhu

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Date of Ingest

2019-02-06T19:36:25Z

Keyword(s)

data mining
multi-dimensional analysis
less supervision

Abstract

As one of the most important data forms, unstructured text data plays a crucial role in data-driven decision making in domains ranging from social networking and information retrieval to healthcare and scientific research. In many emerging applications, people's information needs from text data are becoming multi-dimensional---they demand useful insights for multiple aspects from the given text corpus. However, turning massive text data into multi-dimensional knowledge remains a challenge that cannot be readily addressed by existing data mining techniques. In this thesis, we propose algorithms that turn unstructured text data into multi-dimensional knowledge with limited supervision. We investigate two core questions: 1. How to identify task-relevant data with declarative queries in multiple dimensions? 2. How to distill knowledge from data in a multi-dimensional space? To address the above questions, we propose an integrated cube construction and exploitation framework. First, we develop a cube construction module that organizes unstructured data into a cube structure, by discovering latent multi-dimensional and multi-granular structure from the unstructured text corpus and allocating documents into the structure. Second, we develop a cube exploitation module that models multiple dimensions in the cube space, thereby distilling multi-dimensional knowledge from data to provide insights along multiple dimensions. Together, these two modules constitute an integrated pipeline: leveraging the cube structure, users can perform multi-dimensional, multi-granular data selection with declarative queries; and with cube exploitation algorithms, users can make accurate cross-dimension predictions or extract multi-dimensional patterns for decision making. The proposed framework has two distinctive advantages when turning text data into multi-dimensional knowledge: flexibility and label-efficiency. First, it enables acquiring multi-dimensional knowledge flexibly, as the cube structure allows users to easily identify task-relevant data along multiple dimensions at varied granularities and further distill multi-dimensional knowledge. Second, the algorithms for cube construction and exploitation require little supervision; this makes the framework appealing for many applications where labeled data are expensive to obtain.

Graduation Semester

2018-12

Type of Resource

text

Permalink

http://hdl.handle.net/2142/102465

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Multi-dimensional mining of unstructured data with limited supervision

Zhang, Chao

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In