Files in this item



application/pdfXU-DISSERTATION-2020.pdf (4MB)
(no description provided)PDF


Title:New capabilities for large-scale exploratory data analysis
Author(s):Xu, Liqi
Director of Research:Parameswaran, Aditya
Doctoral Committee Chair(s):Parameswaran, Aditya
Doctoral Committee Member(s):Zhai, ChengXiang; Tao, Xie; Cole, Richard L.
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Exploratory Data Analysis
Data Management
Abstract:The ever-rising diversity of data generated, manipulated, and analyzed every day engenders a variety of data formats, ranging from one fixed dataset to multiple versions of a dataset stored across multiple data sources. This variety of formats has led to substantial challenges in data exploration. Existing systems do not effectively support querying capabilities across these formats: (i) Browsing: When exploring a single dataset, data scientists often need to examine a collection of records that satisfy arbitrary predicates. However, current exploratory data analysis tools mainly focus on visual summarization over browsing. (ii) Versioning: With the proliferation of dataset versions generated during different stages of exploration, exploratory data analysis is no longer just about exploring one static dataset. Instead, data scientists need to keep track of massive numbers of versions, as well as search for versions with specific criteria. (iii) Integrating: Nowadays, datasets are collected and stored at multiple sources (e.g., as part of the IoT). When exploring data, data scientists often need to query and join data across databases at disparate locations. In this dissertation, we propose systems that enable query capabilities to efficiently and effectively fulfill these new demands in data exploration. (i) For browsing, we develop NEEDLETAIL, a data exploration engine that employs a light-weight indexing structure along with efficient algorithms to retrieve any-k valid records for arbitrary queries as quickly as possible. (ii) For versioning, we implement and open-source ORPHEUSDB, a dataset version control system that can efficiently track and query across dataset versions. Since versioning queries in ORPHEUSDB take advantage of array operators in relational database systems, we also conduct an extensive experimental study on understanding array implementations in modern database systems. (iii) For integrating, we leverage machine learning techniques to optimize federated query processing and eventually improve the interactivity of data exploration across disparate databases.
Issue Date:2020-05-04
Rights Information:Copyright 2020 Liqi Xu
Date Available in IDEALS:2020-08-26
Date Deposited:2020-05

This item appears in the following Collection(s)

Item Statistics