Withdraw
Loading…
LPC: Lossless parameter compression for deploying large language model inference on edge devices
Wang, Nachuan
Loading…
Permalink
https://hdl.handle.net/2142/127242
Description
- Title
- LPC: Lossless parameter compression for deploying large language model inference on edge devices
- Author(s)
- Wang, Nachuan
- Issue Date
- 2024-12-02
- Director of Research (if dissertation) or Advisor (if thesis)
- Kim, Nam Sung
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Large Language Models
- Model Compression
- Deep Learning Accelerator
- Abstract
- The deployment of large language models (LLMs) with billions of parameters, particularly on edge systems, is challenging because of the limited memory capacity of edge accelerators, necessitating offloading the storage of parameters to the host memory. This leads to the frequent transfer of parameters over low-bandwidth PCIe, hindering the system’s ability to meet latency and throughput requirements for inference. Many lossy model compression techniques have been proposed to facilitate inference in resource-constrained systems, but they compromise model performance and require time-consuming, model-specific tuning. To tackle this challenge, we present a Lossless Parameter Compression (LPC) technique that exploits the unique numerical characteristics of LLM parameters. Using simple bit-level manipulation of parameters with the popular and hardware complexity-efficient LZ4 algorithm, LPC achieves high compression ratios and high decompression throughput. To demonstrate the efficiency of LPC, we redesign the hardware and software stack of an open-source CNN accelerator to accelerate PyTorch-based LLMs in a host-device full-system environment and integrate it with LPC. By transferring compressed parameters over PCIe and decompressing them on-the-fly within the accelerator, LPC alleviates the PCIe bottlenecks. In evaluations with the OPT-13B and Llama2-13B models, LPC reduces parameter sizes by 26.0–26.2% for BF16 and 19.4–41.7% for INT8 formats, translating into equivalent PCIe bandwidth savings and inference latency speedups of 1.31–1.32× and 1.19–1.43×, respectively.
- Graduation Semester
- 2024-12
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/127242
- Copyright and License Information
- Copyright 2024 Nachuan Wang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…