Improving neuron-level interpretability with white-box language models
Bai, Hao
This item's files can only be accessed by the System Administrators group.
Permalink
https://hdl.handle.net/2142/129717
Description
Title
Improving neuron-level interpretability with white-box language models
Author(s)
Bai, Hao
Issue Date
2025-04-24
Director of Research (if dissertation) or Advisor (if thesis)
Jiang, Nan
Department of Study
Siebel School Comp & Data Sci
Discipline
Computer Science
Degree Granting Institution
University of Illinois Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Neuron-level interpretability
language models
representation learning
Language
eng
Abstract
Neurons in auto-regressive language models like GPT-2 can be interpreted by analyzing their activation patterns. Recent studies have shown that techniques such as dictionary learning, a form of post-hoc sparse coding, enhance this neuron-level interpretability. In our research, we are driven by the goal to fundamentally improve neural network interpretability by embedding sparse coding directly within the model architecture, rather than applying it as an afterthought. In our study, we introduce a white-box transformer-like architecture named Coding RAte TransformEr (crate), explicitly engineered to capture sparse, lowdimensional structures within data distributions. Our comprehensive experiments showcase significant improvements (up to 103% relative improvement) in neuron-level interpretability across a variety of evaluation metrics. Detailed investigations confirm that this enhanced interpretability is steady across different layers irrespective of the model size, underlining crate’s robust performance in enhancing neural network interpretability. Further analysis shows that crate’s increased interpretability comes from its enhanced ability to consistently and distinctively activate on relevant tokens. These findings point towards a promising direction for creating white-box foundation models that excel in neuron-level interpretation.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.