Improving neuron-level interpretability with white-box language models

Bai, Hao

Improving neuron-level interpretability with white-box language models

Bai, Hao

This item's files can only be accessed by the System Administrators group.

Permalink

https://hdl.handle.net/2142/129717

Description

Title

Improving neuron-level interpretability with white-box language models

Author(s)

Bai, Hao

Issue Date

2025-04-24

Director of Research (if dissertation) or Advisor (if thesis)

Jiang, Nan

Department of Study

Siebel School Comp & Data Sci

Discipline

Computer Science

Degree Granting Institution

University of Illinois Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Neuron-level interpretability
language models
representation learning

Language

eng

Abstract

Neurons in auto-regressive language models like GPT-2 can be interpreted by analyzing their activation patterns. Recent studies have shown that techniques such as dictionary learning, a form of post-hoc sparse coding, enhance this neuron-level interpretability. In our research, we are driven by the goal to fundamentally improve neural network interpretability by embedding sparse coding directly within the model architecture, rather than applying it as an afterthought. In our study, we introduce a white-box transformer-like architecture named Coding RAte TransformEr (crate), explicitly engineered to capture sparse, lowdimensional structures within data distributions. Our comprehensive experiments showcase significant improvements (up to 103% relative improvement) in neuron-level interpretability across a variety of evaluation metrics. Detailed investigations confirm that this enhanced interpretability is steady across different layers irrespective of the model size, underlining crate’s robust performance in enhancing neural network interpretability. Further analysis shows that crate’s increased interpretability comes from its enhanced ability to consistently and distinctively activate on relevant tokens. These findings point towards a promising direction for creating white-box foundation models that excel in neuron-level interpretation.

Graduation Semester

2025-05

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/129717

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Improving neuron-level interpretability with white-box language models

Bai, Hao

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In