Cloud systems management with efficient and robust online learning

Qiu, Haoran

Cloud systems management with efficient and robust online learning

Qiu, Haoran

Permalink

https://hdl.handle.net/2142/125536

Description

Title

Cloud systems management with efficient and robust online learning

Author(s)

Qiu, Haoran

Issue Date

2024-06-25

Director of Research (if dissertation) or Advisor (if thesis)

Iyer, Ravishankar K

Doctoral Committee Chair(s)

Iyer, Ravishankar K

Committee Member(s)

Başar, Tamer
Nahrstedt, Klara
Gupta, Indranil
Mutlu, Onur

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Cloud Computing
Distributed Systems
Machine Learning Systems
Resource Management
Quality-of-Service
Reliability
Microservices
Serverless Computing

Abstract

Large-scale cloud computing systems rely heavily on decision-making algorithms for critical system management tasks such as resource allocation, job scheduling, and power management. Manually crafted heuristics for these algorithms become increasingly untenable given the complexity and heterogeneity of modern cloud environments because the intricate interactions across diverse workloads, hardware platforms, and operating conditions make it exceedingly difficult to devise fixed heuristics that work well across all scenarios. Although machine learning techniques have been proposed to automatically learn optimized system management policies, existing approaches face practical limitations and lack the robustness necessary for production-grade cloud systems. This dissertation pioneers a novel abstraction-driven paradigm of efficient and robust online learning to fundamentally transform cloud systems management at scale. We develop a general framework that leverages deep reinforcement learning with system domain knowledge at its core to discover optimized management policies by continuously exploring and refining them through in situ interactions with cloud environments. To practically and robustly apply learning in cloud systems at scale, we further design two key abstractions: (1) A virtual agent abstraction that coordinates the distributed learned policies to resolve multi-agent interferences and stably converge on system-wide objectives, and (2) A meta learner abstraction that extracts generalizable policy embeddings that can rapidly adapt across the breadth of heterogeneous cloud applications and platforms. This abstraction-driven approach provides practical and extensible support for a wide range of online learning algorithms and diverse systems management tasks. Instantiated in systems like FIRM (for microservices), SIMPPO (for serverless computing), and μ-Serve (for deep learning model serving), our innovative framework delivers order-of-magnitude improvements in resource efficiency, performance isolation, power optimization, and generalization compared to traditional heuristic-driven approaches. More profoundly, it establishes the foundations for practical and robust autonomous cloud systems management. Our contributions span the full stack, from mathematical models and optimizations to system design, implementation, and deployment.

Graduation Semester

2024-08

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/125536

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Cloud systems management with efficient and robust online learning

Qiu, Haoran

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In