Towards efficient and reliable infrastructure for machine learning

Patke, Archit

Towards efficient and reliable infrastructure for machine learning

Patke, Archit

Permalink

https://hdl.handle.net/2142/132496

Description

Title

Towards efficient and reliable infrastructure for machine learning

Author(s)

Patke, Archit

Issue Date

2025-12-05

Director of Research (if dissertation) or Advisor (if thesis)

Iyer, Ravishankar K

Doctoral Committee Chair(s)

Iyer, Ravishankar K

Committee Member(s)

Kim, Nam Sung
Huang, Jian
Srivatsa, Mudhakar

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Machine learning infrastructure
Resource disaggregation
Network congestion control
Large language model serving
GPU reliability
Autoscaling
Page migration
High-performance computing

Language

eng

Abstract

The rapid advancement of machine learning, particularly large language and generative models, has enabled transformative applications but at immense computational cost. Training and serving these models requires tens of thousands of specialized accelerators that consume megawatts of power and costs hundreds of millions of dollars. The efficiency of the underlying compute infrastructure is, therefore, critical for making machine learning more accessible and sustainable. This thesis addresses infrastructure efficiency challenges through three interconnected dimensions: core infrastructure optimization, workload adaptation, and reliability enhancement. We present five contributions that introduce novel methodologies and system designs for improving efficiency, resource utilization, and reliability. First, in core infrastructure optimization, we address resource fragmentation by enabling resource disaggregation, and resolving network contention that arises in disaggregated systems. INDIGO addresses memory disaggregation challenges through network-aware page migration. The system uses contextual multi-armed bandits trained on historical application data to make migration decisions that account for network transfer costs and memory access locality benefits. Netscope introduces a delay sensitivity-driven congestion mitigation framework that quantifies how applications are affected by network congestion using probabilistic regression models. The framework dynamically adjusts congestion control parameters based on estimated delay sensitivity, and selectively throttles applications with low sensitivity while protecting delay-sensitive ones. Second, in workload adaptation, we address the unique characteristics of modern ML workloads. QLM improves efficiency of distributed inference for large language models by multiplexing interactive and batch requests. QLM leverages statistical properties of continuous batching to estimate request waiting times in queues and groups requests with similar performance characteristics to enable efficient decision-making and orchestrates request pulling, eviction, load balancing, and model swapping operations. Complementary to QLM, Chiron introduces hierarchical autoscaling that employs multi-level backpressure mechanisms that distinguishes between interactive and batch requests. The framework dynamically adjusts batch sizes at the local level using reactive backpressure and makes global scaling decisions based on request waiting time estimation to enable meeting the service-level objectives while maximizing efficiency. Third, in reliability enhancement, we conduct a comprehensive characterization of GPU failures in modern AI accelerators through analysis of failure data from large-scale production clusters. Our methodology examines failure patterns across hardware components, failure types, propagation mechanisms, and system-wide impacts, thus providing insights into the interplay between hardware failures, resilience mechanisms, and application-level fault tolerance. Together, these contributions provide a holistic approach towards making AI infrastructure more efficient, adaptive, and resilient.

Graduation Semester

2025-12

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/132496

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

Towards efficient and reliable infrastructure for machine learning

Patke, Archit

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In