Pushing the limits of long context LLM inference via KV cache compression
Sharma, Akshat
Loading…
Permalink
https://hdl.handle.net/2142/129176
Description
Title
Pushing the limits of long context LLM inference via KV cache compression
Author(s)
Sharma, Akshat
Issue Date
2025-04-20
Director of Research (if dissertation) or Advisor (if thesis)
Zhang, Minjia
Department of Study
Siebel School Comp & Data Sci
Discipline
Computer Science
Degree Granting Institution
University of Illinois Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Systems for LLM
LLM Inference
Language
eng
Abstract
Efficiently deploying/serving LLMs has become remarkably challenging due to their excessive memory and computational requirements. A critical bottleneck in LLM inference is the memory footprint of the Key-Value (KV) cache, particularly in tasks involving long-context understanding and generation. To address these challenges we introduce MiniKV, a hybrid KV cache optimization technique which compresses the KV cache by combining token eviction and 2-bit quantization. Our approach aims to significantly reduce memory usage while maintaining high accuracy on downstream tasks such as question answering, summarization, code generation, and retrieval. Our evaluations demonstrate that MiniKV achieves an 86% reduction in KV cache size while recovering over 98.5% accuracy across downstream tasks. This sets a new state-of-the-art in balancing accuracy and compression, with notable improvements in inference latency and throughput.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.