Withdraw
Loading…
Dynamic sparsity: enabling efficient, interpretable and generalizable sequence models
Ren, Liliang
Loading…
Permalink
https://hdl.handle.net/2142/125587
Description
- Title
- Dynamic sparsity: enabling efficient, interpretable and generalizable sequence models
- Author(s)
- Ren, Liliang
- Issue Date
- 2024-07-12
- Director of Research (if dissertation) or Advisor (if thesis)
- Zhai, ChengXiang
- Doctoral Committee Chair(s)
- Zhai, ChengXiang
- Committee Member(s)
- Peng, Hao
- Zhao, Han
- Chang, Kevin Chen-Chuan
- Liu, Yang
- Department of Study
- Siebel Computing &DataScience
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Dynamic Sparsity
- Neural Network
- Deep Learning Architecture
- Natural Language Processing
- Machine Learning
- Abstract
- As humans, we consume natural signals and generate structured information to understand our world and propagate knowledge. Equipped with dynamically and sparsely activated neural networks, we conduct the learning process efficiently, make interpretable decisions, and generalize the learned knowledge quickly to unseen scenarios. Recent advances in large language models (LLMs) have shown impressive results of artificial neural networks in understanding the real world through sequential data. Inspired by the dynamic sparsity presented in biological neural networks, the aim of this thesis is to improve the fundamental architecture of the artificial sequence models, so that they can complete the tasks with improved efficiency, interpretability, and generalizability. We start our research by investigating how to develop efficient neural architectures that can extract structured information from natural text sequences under human supervision. Particularly, we focus on extracting a knowledge graph that provides a sparse representation of the relationships between the important entities in a sentence. We show that a neural network that generates an alternating sequence of nodes and edges with hybrid span decoding can achieve state-of-the-art performance with linear time and space complexities. Supervised learning of knowledge graph extraction requires expensive human efforts for tagging relations and entity types in text sequences. This problem becomes more severe when it comes to the science discovery domain because of the cost of finding capable human experts for data annotation. We propose a pre-training objective together with a neural sequence model to sparsely and dynamically extract sentence-level keyword representations with diverse latent types. We show that our model is able to learn interpretable tagging of the sentence elements at the latent level without using any human annotations, and its learned knowledge can be quickly adapted to new domains with few labels. We further explore how to introduce dynamic sparsity to general sequence models for efficient long-sequence modeling. We first propose a general mechanism that enables neural networks to activate submodules sparsely and dynamically for sequence elements with end-to-end differentiability. We then design a neural architecture employing this mechanism to sparsely activate an attention module based on the representations learned from a linear state-space model. The input is sparsely selected into a First-In-First-Out memory that is dynamically updated throughout the generation process. We show that our model can bring significantly better training and inference quality-efficiency trade-offs than state-of-the-art models on various sequence modeling tasks, while at the same time revealing the amount of attention needed for each data sample through the learned sparse activation patterns. Lastly, to establish a large-scale baseline for future research on dynamic sparsity, we propose and scale a simple hybrid model that layer-wise combines sliding window attention and Mamba, a selective state-space model, up to 3.8 billion parameters with 3.2 trillion training tokens. The selective gate in Mamba can be viewed as an input selection mechanism implemented with soft gating. The proposed model substantially outperforms the state-of-the-art transformer architectures on major downstream benchmarks while having a bunch of good properties for efficient long-context processing: it can be length extrapolated infinitely in zero-shot and have perfect memory recall on tasks with instruction-tuning, while maintaining the linear time complexity and the constant memory complexity for sequence generation.
- Graduation Semester
- 2024-08
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/125587
- Copyright and License Information
- Copyright 2024 Liliang Ren
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…