LPC: Lossless parameter compression for deploying large language model inference on edge devices

Wang, Nachuan

LPC: Lossless parameter compression for deploying large language model inference on edge devices

Wang, Nachuan

Permalink

https://hdl.handle.net/2142/127242

Description

Title

LPC: Lossless parameter compression for deploying large language model inference on edge devices

Author(s)

Wang, Nachuan

Issue Date

2024-12-02

Director of Research (if dissertation) or Advisor (if thesis)

Kim, Nam Sung

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Large Language Models
Model Compression
Deep Learning Accelerator

Language

eng

Abstract

The deployment of large language models (LLMs) with billions of parameters, particularly on edge systems, is challenging because of the limited memory capacity of edge accelerators, necessitating offloading the storage of parameters to the host memory. This leads to the frequent transfer of parameters over low-bandwidth PCIe, hindering the system’s ability to meet latency and throughput requirements for inference. Many lossy model compression techniques have been proposed to facilitate inference in resource-constrained systems, but they compromise model performance and require time-consuming, model-specific tuning. To tackle this challenge, we present a Lossless Parameter Compression (LPC) technique that exploits the unique numerical characteristics of LLM parameters. Using simple bit-level manipulation of parameters with the popular and hardware complexity-efficient LZ4 algorithm, LPC achieves high compression ratios and high decompression throughput. To demonstrate the efficiency of LPC, we redesign the hardware and software stack of an open-source CNN accelerator to accelerate PyTorch-based LLMs in a host-device full-system environment and integrate it with LPC. By transferring compressed parameters over PCIe and decompressing them on-the-fly within the accelerator, LPC alleviates the PCIe bottlenecks. In evaluations with the OPT-13B and Llama2-13B models, LPC reduces parameter sizes by 26.0–26.2% for BF16 and 19.4–41.7% for INT8 formats, translating into equivalent PCIe bandwidth savings and inference latency speedups of 1.31–1.32× and 1.19–1.43×, respectively.

Graduation Semester

2024-12

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/127242

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

LPC: Lossless parameter compression for deploying large language model inference on edge devices

Wang, Nachuan

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In