Withdraw
Loading…
Pushing the limits of long context LLM inference via KV cache compression
Sharma, Akshat
Loading…
Permalink
https://hdl.handle.net/2142/129176
Description
- Title
- Pushing the limits of long context LLM inference via KV cache compression
- Author(s)
- Sharma, Akshat
- Issue Date
- 2025-04-20
- Director of Research (if dissertation) or Advisor (if thesis)
- Zhang, Minjia
- Department of Study
- Siebel School Comp & Data Sci
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Systems for LLM, LLM Inference
- Abstract
- Efficiently deploying/serving LLMs has become remarkably challenging due to their excessive memory and computational requirements. A critical bottleneck in LLM inference is the memory footprint of the Key-Value (KV) cache, particularly in tasks involving long-context understanding and generation. To address these challenges we introduce MiniKV, a hybrid KV cache optimization technique which compresses the KV cache by combining token eviction and 2-bit quantization. Our approach aims to significantly reduce memory usage while maintaining high accuracy on downstream tasks such as question answering, summarization, code generation, and retrieval. Our evaluations demonstrate that MiniKV achieves an 86% reduction in KV cache size while recovering over 98.5% accuracy across downstream tasks. This sets a new state-of-the-art in balancing accuracy and compression, with notable improvements in inference latency and throughput.
- Graduation Semester
- 2025-05
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129176
- Copyright and License Information
- Copyright 2025 Akshat Sharma
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…