Withdraw
Loading…
Efficient LLM training and inference with contextual sparsity
Ge, Suyu
This item's files can only be accessed by the Administrator group.
Permalink
https://hdl.handle.net/2142/129584
Description
- Title
- Efficient LLM training and inference with contextual sparsity
- Author(s)
- Ge, Suyu
- Issue Date
- 2025-04-25
- Director of Research (if dissertation) or Advisor (if thesis)
- Han, Jiawei
- Peng, Hao
- Doctoral Committee Chair(s)
- Han, Jiawei
- Peng, Hao
- Committee Member(s)
- Hakkani-Tür, Dilek
- Gao, Jianfeng
- Department of Study
- Siebel School Comp & Data Sci
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Large Language Model Efficiency
- Abstract
- Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of natural language tasks. However, their computational demands present significant challenges, particularly regarding memory consumption and processing speed for long contexts. This thesis addresses these challenges through three interconnected research directions that collectively form a comprehensive framework for efficient transformer-based models: FastGen: An adaptive KV cache compression technique that analyzes intrinsic attention structures and dynamically retains only essential key-value pairs during inference. By recognizing that different attention heads exhibit distinct patterns (e.g., focusing on local context, special tokens, or broadly attending to all tokens), FastGen reduces KV cache memory consumption by up to 40\% with negligible impact on generation quality. LongGen: A hybrid architecture that enables efficient long-context processing through partial contexts. This approach finetunes pretrained LLMs into an efficient architecture during context-length extension, integrating sparse attention in strategic layers while maintaining full attention where needed. LongGen achieves a 36\% reduction in training time and 62\% reduction in KV cache memory while maintaining strong performance on challenging long-context tasks, including perfect accuracy on needle-in-a-haystack retrieval at 128K tokens. S2-Attention: A hardware-aware sparse attention kernel that significantly accelerates both training and inference. By implementing context sharding across attention heads with careful consideration of GPU memory access patterns, S2-Attention achieves up to 15.9× speedup in attention computation and 4.5× end-to-end inference acceleration compared to dense attention baselines. Our comprehensive evaluations across different model scales (1.3B to 70B parameters) and benchmarks demonstrate that these approaches not only improve computational efficiency but also maintain or enhance model performance on challenging tasks. Together, these contributions advance the state of the art in efficient LLM deployment, enabling better utilization of computational resources and expanding the practical applications of these powerful models in resource-constrained environments and long-context scenarios.
- Graduation Semester
- 2025-05
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129584
- Copyright and License Information
- Copyright 2025 Suyu Ge
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…