LPC: Lossless parameter compression for deploying large language model inference on edge devices
Wang, Nachuan
Loading…
Permalink
https://hdl.handle.net/2142/127242
Description
Title
LPC: Lossless parameter compression for deploying large language model inference on edge devices
Author(s)
Wang, Nachuan
Issue Date
2024-12-02
Director of Research (if dissertation) or Advisor (if thesis)
Kim, Nam Sung
Department of Study
Electrical & Computer Eng
Discipline
Electrical & Computer Engr
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Large Language Models
Model Compression
Deep Learning Accelerator
Language
eng
Abstract
The deployment of large language models (LLMs) with billions of parameters, particularly on edge systems, is challenging because of the limited memory capacity of edge accelerators, necessitating offloading the storage of parameters to the host memory. This leads to the frequent transfer of parameters over low-bandwidth PCIe, hindering the system’s ability to meet latency and throughput requirements for inference. Many lossy model compression techniques have been proposed to facilitate inference in resource-constrained systems, but they compromise model performance and require time-consuming, model-specific tuning. To tackle this challenge, we present a Lossless Parameter Compression (LPC) technique that exploits the unique numerical characteristics of LLM parameters. Using simple bit-level manipulation of parameters with the popular and hardware complexity-efficient LZ4 algorithm, LPC achieves high compression ratios and high decompression throughput. To demonstrate the efficiency of LPC, we redesign the hardware and software stack of an open-source CNN accelerator to accelerate PyTorch-based LLMs in a host-device full-system environment and integrate it with LPC. By transferring compressed parameters over PCIe and decompressing them on-the-fly within the accelerator, LPC alleviates the PCIe bottlenecks. In evaluations with the OPT-13B and Llama2-13B models, LPC reduces parameter sizes by 26.0–26.2% for BF16 and 19.4–41.7% for INT8 formats, translating into equivalent PCIe bandwidth savings and inference latency speedups of 1.31–1.32× and 1.19–1.43×, respectively.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.