SLO-aware optimization and stateful orchestration for LLM systems

Wu, Zhiyu

SLO-aware optimization and stateful orchestration for LLM systems

Wu, Zhiyu

Permalink

https://hdl.handle.net/2142/132593

Description

Title

SLO-aware optimization and stateful orchestration for LLM systems

Author(s)

Wu, Zhiyu

Issue Date

2025-12-09

Director of Research (if dissertation) or Advisor (if thesis)

Lai, Fan

Department of Study

Siebel School Comp & Data Sci

Discipline

Computer Science

Degree Granting Institution

University of Illinois Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Large Language Models
Multi-Agent Systems
Agentic Workflow
SLO-Aware
scheduling
JVM

Language

eng

Abstract

The rapid evolution of Large Language Models (LLMs) has shifted the focus of AI infrastructure from simple text generation to complex, multi-turn agentic workflows. As these applications become increasingly sensitive to latency and dependencies, existing serving systems—which primarily optimize for aggregate throughput—fail to meet application-specific Service Level Objectives (SLOs). Furthermore, as workloads evolve into multi-agent systems (MAS), the lack of robust state management and error recovery in current runtimes creates a bottleneck for reliable orchestration. This thesis addresses these challenges by proposing a comprehensive optimization of the LLM runtime stack. First, we present \name, an SLO-aware serving system designed to maximize service "goodput" (the rate of requests served within strict performance goals) under imprecise request information. \name employs a novel iterative scheduling algorithm and Criticality-Aware Length Matching (CALM) to dynamically refine resource allocation as generation progresses. Evaluation across diverse realistic workloads, including chat, deep research, and agentic pipelines, demonstrates that \name improves service goodput by 1.4×–6.3× and achieves 28.5%–83.2% resource savings compared to state-of-the-art designs. Building upon this optimized serving layer, the thesis concludes by exploring the future of Stateful Agent Orchestration. We propose the design of an ML Agent Compiler, a runtime environment akin to a JVM for agents. This proposed framework addresses the limitations of current stateless orchestration by introducing graph-based checkpointing, forking engines, and deduplication of partial executions. Together, these works chart a path toward a unified, efficient, and fault-tolerant infrastructure for the next generation of AI applications.

Graduation Semester

2025-12

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/132593

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

SLO-aware optimization and stateful orchestration for LLM systems

Wu, Zhiyu

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In