Dynamic sparsity: enabling efficient, interpretable and generalizable sequence models

Ren, Liliang

Dynamic sparsity: enabling efficient, interpretable and generalizable sequence models

Ren, Liliang

Permalink

https://hdl.handle.net/2142/125587

Description

Title

Dynamic sparsity: enabling efficient, interpretable and generalizable sequence models

Author(s)

Ren, Liliang

Issue Date

2024-07-12

Director of Research (if dissertation) or Advisor (if thesis)

Zhai, ChengXiang

Doctoral Committee Chair(s)

Zhai, ChengXiang

Committee Member(s)

Peng, Hao
Zhao, Han
Chang, Kevin Chen-Chuan
Liu, Yang

Department of Study

Siebel Computing &DataScience

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Dynamic Sparsity
Neural Network
Deep Learning Architecture
Natural Language Processing
Machine Learning

Abstract

As humans, we consume natural signals and generate structured information to understand our world and propagate knowledge. Equipped with dynamically and sparsely activated neural networks, we conduct the learning process efficiently, make interpretable decisions, and generalize the learned knowledge quickly to unseen scenarios. Recent advances in large language models (LLMs) have shown impressive results of artificial neural networks in understanding the real world through sequential data. Inspired by the dynamic sparsity presented in biological neural networks, the aim of this thesis is to improve the fundamental architecture of the artificial sequence models, so that they can complete the tasks with improved efficiency, interpretability, and generalizability. We start our research by investigating how to develop efficient neural architectures that can extract structured information from natural text sequences under human supervision. Particularly, we focus on extracting a knowledge graph that provides a sparse representation of the relationships between the important entities in a sentence. We show that a neural network that generates an alternating sequence of nodes and edges with hybrid span decoding can achieve state-of-the-art performance with linear time and space complexities. Supervised learning of knowledge graph extraction requires expensive human efforts for tagging relations and entity types in text sequences. This problem becomes more severe when it comes to the science discovery domain because of the cost of finding capable human experts for data annotation. We propose a pre-training objective together with a neural sequence model to sparsely and dynamically extract sentence-level keyword representations with diverse latent types. We show that our model is able to learn interpretable tagging of the sentence elements at the latent level without using any human annotations, and its learned knowledge can be quickly adapted to new domains with few labels. We further explore how to introduce dynamic sparsity to general sequence models for efficient long-sequence modeling. We first propose a general mechanism that enables neural networks to activate submodules sparsely and dynamically for sequence elements with end-to-end differentiability. We then design a neural architecture employing this mechanism to sparsely activate an attention module based on the representations learned from a linear state-space model. The input is sparsely selected into a First-In-First-Out memory that is dynamically updated throughout the generation process. We show that our model can bring significantly better training and inference quality-efficiency trade-offs than state-of-the-art models on various sequence modeling tasks, while at the same time revealing the amount of attention needed for each data sample through the learned sparse activation patterns. Lastly, to establish a large-scale baseline for future research on dynamic sparsity, we propose and scale a simple hybrid model that layer-wise combines sliding window attention and Mamba, a selective state-space model, up to 3.8 billion parameters with 3.2 trillion training tokens. The selective gate in Mamba can be viewed as an input selection mechanism implemented with soft gating. The proposed model substantially outperforms the state-of-the-art transformer architectures on major downstream benchmarks while having a bunch of good properties for efficient long-context processing: it can be length extrapolated infinitely in zero-shot and have perfect memory recall on tasks with instruction-tuning, while maintaining the linear time complexity and the constant memory complexity for sequence generation.

Graduation Semester

2024-08

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/125587

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dynamic sparsity: enabling efficient, interpretable and generalizable sequence models

Ren, Liliang

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In