Withdraw
Loading…
Cloud systems management with efficient and robust online learning
Qiu, Haoran
Loading…
Permalink
https://hdl.handle.net/2142/125536
Description
- Title
- Cloud systems management with efficient and robust online learning
- Author(s)
- Qiu, Haoran
- Issue Date
- 2024-06-25
- Director of Research (if dissertation) or Advisor (if thesis)
- Iyer, Ravishankar K
- Doctoral Committee Chair(s)
- Iyer, Ravishankar K
- Committee Member(s)
- Başar, Tamer
- Nahrstedt, Klara
- Gupta, Indranil
- Mutlu, Onur
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Cloud Computing
- Distributed Systems
- Machine Learning Systems
- Resource Management
- Quality-of-Service
- Reliability
- Microservices
- Serverless Computing
- Abstract
- Large-scale cloud computing systems rely heavily on decision-making algorithms for critical system management tasks such as resource allocation, job scheduling, and power management. Manually crafted heuristics for these algorithms become increasingly untenable given the complexity and heterogeneity of modern cloud environments because the intricate interactions across diverse workloads, hardware platforms, and operating conditions make it exceedingly difficult to devise fixed heuristics that work well across all scenarios. Although machine learning techniques have been proposed to automatically learn optimized system management policies, existing approaches face practical limitations and lack the robustness necessary for production-grade cloud systems. This dissertation pioneers a novel abstraction-driven paradigm of efficient and robust online learning to fundamentally transform cloud systems management at scale. We develop a general framework that leverages deep reinforcement learning with system domain knowledge at its core to discover optimized management policies by continuously exploring and refining them through in situ interactions with cloud environments. To practically and robustly apply learning in cloud systems at scale, we further design two key abstractions: (1) A virtual agent abstraction that coordinates the distributed learned policies to resolve multi-agent interferences and stably converge on system-wide objectives, and (2) A meta learner abstraction that extracts generalizable policy embeddings that can rapidly adapt across the breadth of heterogeneous cloud applications and platforms. This abstraction-driven approach provides practical and extensible support for a wide range of online learning algorithms and diverse systems management tasks. Instantiated in systems like FIRM (for microservices), SIMPPO (for serverless computing), and μ-Serve (for deep learning model serving), our innovative framework delivers order-of-magnitude improvements in resource efficiency, performance isolation, power optimization, and generalization compared to traditional heuristic-driven approaches. More profoundly, it establishes the foundations for practical and robust autonomous cloud systems management. Our contributions span the full stack, from mathematical models and optimizations to system design, implementation, and deployment.
- Graduation Semester
- 2024-08
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/125536
- Copyright and License Information
- Copyright 2024 Haoran Qiu
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…