Withdraw
Loading…
Improving neuron-level interpretability with white-box language models
Bai, Hao
This item's files can only be accessed by the System Administrators group.
Permalink
https://hdl.handle.net/2142/129717
Description
- Title
- Improving neuron-level interpretability with white-box language models
- Author(s)
- Bai, Hao
- Issue Date
- 2025-04-24
- Director of Research (if dissertation) or Advisor (if thesis)
- Jiang, Nan
- Department of Study
- Siebel School Comp & Data Sci
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Neuron-level interpretability, language models, representation learning
- Abstract
- Neurons in auto-regressive language models like GPT-2 can be interpreted by analyzing their activation patterns. Recent studies have shown that techniques such as dictionary learning, a form of post-hoc sparse coding, enhance this neuron-level interpretability. In our research, we are driven by the goal to fundamentally improve neural network interpretability by embedding sparse coding directly within the model architecture, rather than applying it as an afterthought. In our study, we introduce a white-box transformer-like architecture named Coding RAte TransformEr (crate), explicitly engineered to capture sparse, lowdimensional structures within data distributions. Our comprehensive experiments showcase significant improvements (up to 103% relative improvement) in neuron-level interpretability across a variety of evaluation metrics. Detailed investigations confirm that this enhanced interpretability is steady across different layers irrespective of the model size, underlining crate’s robust performance in enhancing neural network interpretability. Further analysis shows that crate’s increased interpretability comes from its enhanced ability to consistently and distinctively activate on relevant tokens. These findings point towards a promising direction for creating white-box foundation models that excel in neuron-level interpretation.
- Graduation Semester
- 2025-05
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129717
- Copyright and License Information
- Copyright 2025 Hao Bai
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…