Self-tuning data exploration checkpoint

Chockchowwat, Supawit

Self-tuning data exploration checkpoint

Chockchowwat, Supawit

Permalink

https://hdl.handle.net/2142/129394

Description

Title

Self-tuning data exploration checkpoint

Author(s)

Chockchowwat, Supawit

Issue Date

2025-04-14

Director of Research (if dissertation) or Advisor (if thesis)

Park, Yongjoo

Doctoral Committee Chair(s)

Park, Yongjoo

Committee Member(s)

Sundaram, Hari
Gupta, Indranil
Özcan, Fatma

Department of Study

Siebel School Comp & Data Sci

Discipline

Computer Science

Degree Granting Institution

University of Illinois Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Information Retrieval
Inverted Index
IoU Sketch
Multi-Layer Hash Table
Physical Database Design
Separation of Compute and Storage
Sketch Data Structure
Index Tuning
Hierarchical Index Design
I/O-Aware Indexing
Storage-Aware Optimization
Graph-Based Optimization
End-to-End Latency Minimization
Learned Indexes
Storage Profiling
Data Management
Transactional Python
Chipmink
Airphant
AirIndex
Data Exploration
Exploratory Data Analysis
Interactive Data Exploration
Data Science
Checkpoint
Checkpointing
Restoration
Object Store
Object Databases
Graph-Based Versioning
Object-Level Versioning
Fine-Grained Version Control
Podding
Delta Encoding
Mutable Object Storage
Data Lineage
Semantic Versioning
Version Navigation
Temporal Database
Branching
In-Memory State Management
Immutable Snapshots
Time-Travel Queries
Storage Efficiency
Incremental Checkpointing
Version Indexing
Change Tracking
Non-Linear Histories
Lightweight Snapshotting
Data Reproducibility
Cloud Storage
Database as a Service
Indexing

Language

eng

Abstract

Interactive data exploration has become a cornerstone of modern data science, empowering analysts and researchers to iteratively develop insights using computational notebooks such as Jupyter. However, these exploratory workflows often suffer from a lack of robust mechanisms to persist and restore program state, leading to risks of state loss, redundant recomputation, and inefficient trial-and-error cycles. Current data exploration tools provide limited support for systematic state checkpointing and restoration, imposing heavy computational and storage costs when naively implemented. This thesis argues that it is possible to efficiently checkpoint and restore exploration states with self-tuning data systems. We identify two key challenges in this domain: (1) the high overhead and interruptions caused by capturing and storing exploration states, and (2) the delays and downtimes incurred when restoring fragmented checkpoint data from diverse storage environments. To address these challenges, this thesis presents three core systems that together enable practical and performant checkpointing. First, we introduce Chipmink, a delta object store that leverages graph-based dirty object identification to capture fine-grained state changes with minimal overhead. Chipmink introduces techniques such as podding, learned volatility models, and asynchronous checkpointing to significantly reduce checkpointing time and storage consumption to seconds compare to minutes by existing solutions and at most GBs compared to hundreds of GBs. Second, we present Airphant, an automatically tuned filtering system that accelerates fragment retrieval in high-latency cloud storage. By using a novel IoU Sketch filter, Airphant supports concurrent batched I/O operations, reducing data retrieval latency to within hundreds of milliseconds even under cloud storage. Third, we develop AirIndex, an automatically tuned hierarchical indexes that learns from data and I/O characteristics to optimize lookup paths dynamically. AirIndex formulates an index search problem over a large design space and leverages a purpose-built search algorithm to select optimal index configurations. It delivers faster lookup performance compared to traditional and learned indexes. Together, these systems demonstrate that self-tuning data structures can overcome the performance and cost barriers of checkpointing and restoration in interactive data exploration. This thesis contributes new methods and system designs that enable resilient, efficient, and user-transparent state management for computational notebooks—paving the way toward more interactive, fault-tolerant, and productive data science workflows.

Graduation Semester

2025-05

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/129394

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Self-tuning data exploration checkpoint

Chockchowwat, Supawit

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In