Efficient LLM training and inference with contextual sparsity

Ge, Suyu

Efficient LLM training and inference with contextual sparsity

Ge, Suyu

This item's files can only be accessed by the System Administrators group.

Permalink

https://hdl.handle.net/2142/129584

Description

Title

Efficient LLM training and inference with contextual sparsity

Author(s)

Ge, Suyu

Issue Date

2025-04-25

Director of Research (if dissertation) or Advisor (if thesis)

Han, Jiawei
Peng, Hao

Doctoral Committee Chair(s)

Han, Jiawei
Peng, Hao

Committee Member(s)

Hakkani-Tür, Dilek
Gao, Jianfeng

Department of Study

Siebel School Comp & Data Sci

Discipline

Computer Science

Degree Granting Institution

University of Illinois Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Large Language Model Efficiency

Language

eng

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of natural language tasks. However, their computational demands present significant challenges, particularly regarding memory consumption and processing speed for long contexts. This thesis addresses these challenges through three interconnected research directions that collectively form a comprehensive framework for efficient transformer-based models: FastGen: An adaptive KV cache compression technique that analyzes intrinsic attention structures and dynamically retains only essential key-value pairs during inference. By recognizing that different attention heads exhibit distinct patterns (e.g., focusing on local context, special tokens, or broadly attending to all tokens), FastGen reduces KV cache memory consumption by up to 40\% with negligible impact on generation quality. LongGen: A hybrid architecture that enables efficient long-context processing through partial contexts. This approach finetunes pretrained LLMs into an efficient architecture during context-length extension, integrating sparse attention in strategic layers while maintaining full attention where needed. LongGen achieves a 36\% reduction in training time and 62\% reduction in KV cache memory while maintaining strong performance on challenging long-context tasks, including perfect accuracy on needle-in-a-haystack retrieval at 128K tokens. S2-Attention: A hardware-aware sparse attention kernel that significantly accelerates both training and inference. By implementing context sharding across attention heads with careful consideration of GPU memory access patterns, S2-Attention achieves up to 15.9× speedup in attention computation and 4.5× end-to-end inference acceleration compared to dense attention baselines. Our comprehensive evaluations across different model scales (1.3B to 70B parameters) and benchmarks demonstrate that these approaches not only improve computational efficiency but also maintain or enhance model performance on challenging tasks. Together, these contributions advance the state of the art in efficient LLM deployment, enabling better utilization of computational resources and expanding the practical applications of these powerful models in resource-constrained environments and long-context scenarios.

Graduation Semester

2025-05

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/129584

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Efficient LLM training and inference with contextual sparsity

Ge, Suyu

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In