Withdraw
Loading…
Towards efficient and reliable infrastructure for machine learning
Patke, Archit
Loading…
Permalink
https://hdl.handle.net/2142/132496
Description
- Title
- Towards efficient and reliable infrastructure for machine learning
- Author(s)
- Patke, Archit
- Issue Date
- 2025-12-05
- Director of Research (if dissertation) or Advisor (if thesis)
- Iyer, Ravishankar K
- Doctoral Committee Chair(s)
- Iyer, Ravishankar K
- Committee Member(s)
- Kim, Nam Sung
- Huang, Jian
- Srivatsa, Mudhakar
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Machine learning infrastructure
- Resource disaggregation
- Network congestion control
- Large language model serving
- GPU reliability
- Autoscaling
- Page migration
- High-performance computing
- Abstract
- The rapid advancement of machine learning, particularly large language and generative models, has enabled transformative applications but at immense computational cost. Training and serving these models requires tens of thousands of specialized accelerators that consume megawatts of power and costs hundreds of millions of dollars. The efficiency of the underlying compute infrastructure is, therefore, critical for making machine learning more accessible and sustainable. This thesis addresses infrastructure efficiency challenges through three interconnected dimensions: core infrastructure optimization, workload adaptation, and reliability enhancement. We present five contributions that introduce novel methodologies and system designs for improving efficiency, resource utilization, and reliability. First, in core infrastructure optimization, we address resource fragmentation by enabling resource disaggregation, and resolving network contention that arises in disaggregated systems. INDIGO addresses memory disaggregation challenges through network-aware page migration. The system uses contextual multi-armed bandits trained on historical application data to make migration decisions that account for network transfer costs and memory access locality benefits. Netscope introduces a delay sensitivity-driven congestion mitigation framework that quantifies how applications are affected by network congestion using probabilistic regression models. The framework dynamically adjusts congestion control parameters based on estimated delay sensitivity, and selectively throttles applications with low sensitivity while protecting delay-sensitive ones. Second, in workload adaptation, we address the unique characteristics of modern ML workloads. QLM improves efficiency of distributed inference for large language models by multiplexing interactive and batch requests. QLM leverages statistical properties of continuous batching to estimate request waiting times in queues and groups requests with similar performance characteristics to enable efficient decision-making and orchestrates request pulling, eviction, load balancing, and model swapping operations. Complementary to QLM, Chiron introduces hierarchical autoscaling that employs multi-level backpressure mechanisms that distinguishes between interactive and batch requests. The framework dynamically adjusts batch sizes at the local level using reactive backpressure and makes global scaling decisions based on request waiting time estimation to enable meeting the service-level objectives while maximizing efficiency. Third, in reliability enhancement, we conduct a comprehensive characterization of GPU failures in modern AI accelerators through analysis of failure data from large-scale production clusters. Our methodology examines failure patterns across hardware components, failure types, propagation mechanisms, and system-wide impacts, thus providing insights into the interplay between hardware failures, resilience mechanisms, and application-level fault tolerance. Together, these contributions provide a holistic approach towards making AI infrastructure more efficient, adaptive, and resilient.
- Graduation Semester
- 2025-12
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/132496
- Copyright and License Information
- Copyright 2025 Archit Patke
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…