Withdraw
Loading…
Self-tuning data exploration checkpoint
Chockchowwat, Supawit
Loading…
Permalink
https://hdl.handle.net/2142/129394
Description
- Title
- Self-tuning data exploration checkpoint
- Author(s)
- Chockchowwat, Supawit
- Issue Date
- 2025-04-14
- Director of Research (if dissertation) or Advisor (if thesis)
- Park, Yongjoo
- Doctoral Committee Chair(s)
- Park, Yongjoo
- Committee Member(s)
- Sundaram, Hari
- Gupta, Indranil
- Özcan, Fatma
- Department of Study
- Siebel School Comp & Data Sci
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Transactional Python
- Chipmink
- Airphant
- AirIndex
- Data Exploration
- Exploratory Data Analysis
- Interactive Data Exploration
- Data Science
- Checkpoint
- Checkpointing
- Restoration
- Object Store
- Object Databases
- Graph-Based Versioning
- Object-Level Versioning
- Fine-Grained Version Control
- Podding
- Delta Encoding
- Mutable Object Storage
- Data Lineage
- Semantic Versioning
- Version Navigation
- Temporal Database
- Branching
- In-Memory State Management
- Immutable Snapshots
- Time-Travel Queries
- Storage Efficiency
- Incremental Checkpointing
- Version Indexing
- Change Tracking
- Non-Linear Histories
- Lightweight Snapshotting
- Data Reproducibility
- Cloud Storage
- Database as a Service
- Indexing
- Information Retrieval
- Inverted Index
- IoU Sketch
- Multi-Layer Hash Table
- Physical Database Design
- Separation of Compute and Storage
- Sketch Data Structure
- Index Tuning
- Hierarchical Index Design
- I/O-Aware Indexing
- Storage-Aware Optimization
- Graph-Based Optimization
- End-to-End Latency Minimization
- Learned Indexes
- Storage Profiling
- Data Management
- Abstract
- Interactive data exploration has become a cornerstone of modern data science, empowering analysts and researchers to iteratively develop insights using computational notebooks such as Jupyter. However, these exploratory workflows often suffer from a lack of robust mechanisms to persist and restore program state, leading to risks of state loss, redundant recomputation, and inefficient trial-and-error cycles. Current data exploration tools provide limited support for systematic state checkpointing and restoration, imposing heavy computational and storage costs when naively implemented. This thesis argues that it is possible to efficiently checkpoint and restore exploration states with self-tuning data systems. We identify two key challenges in this domain: (1) the high overhead and interruptions caused by capturing and storing exploration states, and (2) the delays and downtimes incurred when restoring fragmented checkpoint data from diverse storage environments. To address these challenges, this thesis presents three core systems that together enable practical and performant checkpointing. First, we introduce Chipmink, a delta object store that leverages graph-based dirty object identification to capture fine-grained state changes with minimal overhead. Chipmink introduces techniques such as podding, learned volatility models, and asynchronous checkpointing to significantly reduce checkpointing time and storage consumption to seconds compare to minutes by existing solutions and at most GBs compared to hundreds of GBs. Second, we present Airphant, an automatically tuned filtering system that accelerates fragment retrieval in high-latency cloud storage. By using a novel IoU Sketch filter, Airphant supports concurrent batched I/O operations, reducing data retrieval latency to within hundreds of milliseconds even under cloud storage. Third, we develop AirIndex, an automatically tuned hierarchical indexes that learns from data and I/O characteristics to optimize lookup paths dynamically. AirIndex formulates an index search problem over a large design space and leverages a purpose-built search algorithm to select optimal index configurations. It delivers faster lookup performance compared to traditional and learned indexes. Together, these systems demonstrate that self-tuning data structures can overcome the performance and cost barriers of checkpointing and restoration in interactive data exploration. This thesis contributes new methods and system designs that enable resilient, efficient, and user-transparent state management for computational notebooks—paving the way toward more interactive, fault-tolerant, and productive data science workflows.
- Graduation Semester
- 2025-05
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129394
- Copyright and License Information
- Copyright 2025 Supawit Chockchowwat
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…