Withdraw
Loading…
SLO-aware optimization and stateful orchestration for LLM systems
Wu, Zhiyu
Loading…
Permalink
https://hdl.handle.net/2142/132593
Description
- Title
- SLO-aware optimization and stateful orchestration for LLM systems
- Author(s)
- Wu, Zhiyu
- Issue Date
- 2025-12-09
- Director of Research (if dissertation) or Advisor (if thesis)
- Lai, Fan
- Department of Study
- Siebel School Comp & Data Sci
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Large Language Models
- Multi-Agent Systems
- Agentic Workflow
- SLO-Aware
- scheduling
- JVM
- Abstract
- The rapid evolution of Large Language Models (LLMs) has shifted the focus of AI infrastructure from simple text generation to complex, multi-turn agentic workflows. As these applications become increasingly sensitive to latency and dependencies, existing serving systems—which primarily optimize for aggregate throughput—fail to meet application-specific Service Level Objectives (SLOs). Furthermore, as workloads evolve into multi-agent systems (MAS), the lack of robust state management and error recovery in current runtimes creates a bottleneck for reliable orchestration. This thesis addresses these challenges by proposing a comprehensive optimization of the LLM runtime stack. First, we present \name, an SLO-aware serving system designed to maximize service "goodput" (the rate of requests served within strict performance goals) under imprecise request information. \name employs a novel iterative scheduling algorithm and Criticality-Aware Length Matching (CALM) to dynamically refine resource allocation as generation progresses. Evaluation across diverse realistic workloads, including chat, deep research, and agentic pipelines, demonstrates that \name improves service goodput by 1.4×–6.3× and achieves 28.5%–83.2% resource savings compared to state-of-the-art designs. Building upon this optimized serving layer, the thesis concludes by exploring the future of Stateful Agent Orchestration. We propose the design of an ML Agent Compiler, a runtime environment akin to a JVM for agents. This proposed framework addresses the limitations of current stateless orchestration by introducing graph-based checkpointing, forking engines, and deduplication of partial executions. Together, these works chart a path toward a unified, efficient, and fault-tolerant infrastructure for the next generation of AI applications.
- Graduation Semester
- 2025-12
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/132593
- Copyright and License Information
- Copyright 2025 Zhiyu Wu
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…