Withdraw
Loading…
A large-scale model serving framework
Zhou, Qinren
Loading…
Permalink
https://hdl.handle.net/2142/124583
Description
- Title
- A large-scale model serving framework
- Author(s)
- Zhou, Qinren
- Issue Date
- 2024-04-30
- Director of Research (if dissertation) or Advisor (if thesis)
- Kindratenko, Volodymyr
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Hpc
- Llm
- Language
- eng
- Abstract
- Recent advancements in large-scale models have significantly enhanced the capabilities of artificial intelligence. For instance, generative large language models (LLM) have broadly transformed our interactions with websites, devices, and information, in general. These models are increasingly valuable across various domains but pose substantial deployment challenges. They require intense computational resources because of their tens or hundreds of billions of model parameters, necessitating high-end GPUs with large memory capacities. High-performance computing (HPC) clusters, typically equipped with considerable GPU resources, are suitable for serving large-scale models. Nevertheless, their resources are finite and may become inadequate if user requests surpass the cluster’s capacity, unlike the scalable and nearly limitless resources of cloud services. Moreover, HPC clusters lack autoscaling capabilities, introducing additional challenges in accommodating dynamic user demands and optimizing resource utilization. This thesis presents a model-serving framework to optimize resource utilization and reduce model inference latency. By leveraging the Ray Serve library, the framework automatically manages the deployment and reclamation of models, thereby enabling efficient concurrent service to multiple users. The system incorporates a model-switching mechanism and a Slurm-compatible autoscaler. These features are specifically designed to address traditional HPC clusters’ finite resource and autoscaling limitations and significantly improve system responsiveness and user experience in AI applications.
- Graduation Semester
- 2024-05
- Type of Resource
- Text
- Handle URL
- https://hdl.handle.net/2142/124583
- Copyright and License Information
- Copyright 2024 Qinren Zhou
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Electrical and Computer Engineering
Dissertations and Theses in Electrical and Computer EngineeringManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…