Efficient data reconfiguration for today's cloud systems

Ghosh, Mainak

Efficient data reconfiguration for today's cloud systems

Ghosh, Mainak

Permalink

https://hdl.handle.net/2142/102420

Description

Title

Efficient data reconfiguration for today's cloud systems

Author(s)

Ghosh, Mainak

Issue Date

2018-11-12

Director of Research (if dissertation) or Advisor (if thesis)

Gupta, Indranil

Doctoral Committee Chair(s)

Gupta, Indranil

Committee Member(s)

Vaidya, Nitin
Olson, Luke
Elmore, Aaron

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Date of Ingest

2019-02-06T19:32:48Z

Keyword(s)

reconfiguration
partitioning
replication
prefetching
compaction
nosql databases
interactive analytics engines
tiered storage systems

Abstract

Performance of big data systems largely relies on efficient data reconfiguration techniques. Data reconfiguration operations deal with changing configuration parameters that affect data layout in a system. They could be user-initiated like changing shard key, block size in NoSQL databases, or system-initiated like changing replication in distributed interactive analytics engine. Current data reconfiguration schemes are heuristics at best and often do not scale well as data volume grows. As a result, system performance suffers. In this thesis, we show that {\it data reconfiguration mechanisms can be done in the background by using new optimal or near-optimal algorithms coupling them with performant system designs}. We explore four different data reconfiguration operations affecting three popular types of systems -- storage, real-time analytics and batch analytics. In NoSQL databases (storage), we explore new strategies for changing table-level configuration and for compaction as they improve read/write latencies. In distributed interactive analytics engines, a good replication algorithm can save costs by judiciously using memory that is sufficient to provide the highest throughput and low latency for queries. Finally, in batch processing systems, we explore prefetching and caching strategies that can improve the number of production jobs meeting their SLOs. All these operations happen in the background without affecting the fast path. Our contributions in each of the problems are two-fold -- 1) we model the problem and design algorithms inspired from well-known theoretical abstractions, 2) we design and build a system on top of popular open source systems used in companies today. Finally, using real-life workloads, we evaluate the efficacy of our solutions. Morphus and Parqua provide several 9s of availability while changing table level configuration parameters in databases. By halving memory usage in distributed interactive analytics engine, Getafix reduces cost of deploying the system by 10 million dollars annually and improves query throughput. We are the first to model the problem of compaction and provide formal bounds on their runtime. Finally, NetCachier helps 30\% more production jobs to meet their SLOs compared to existing state-of-the-art.

Graduation Semester

2018-12

Type of Resource

text

Permalink

http://hdl.handle.net/2142/102420

Copyright and License Information

2018 Mainak Ghosh

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Efficient data reconfiguration for today's cloud systems

Ghosh, Mainak

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In