Enhancing knowledge distillation in large language models via domain adaptation

Zhang, Xitong (Jacqueline)

Enhancing knowledge distillation in large language models via domain adaptation

Zhang, Xitong (Jacqueline)

This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.

Permalink

https://hdl.handle.net/2142/132689

Description

Title

Enhancing knowledge distillation in large language models via domain adaptation

Author(s)

Zhang, Xitong (Jacqueline)

Issue Date

2025-12-08

Director of Research (if dissertation) or Advisor (if thesis)

He, Jingrui

Committee Member(s)

Ma, Jiaqi

Department of Study

Information Sciences

Discipline

Bioinformatics

Degree Granting Institution

University of Illinois Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Knowledge Distillation
Large Language Models
Deep Learning
Machine Learning
Artificial Intelligence

Abstract

Domain-Adaptive Pre-Training (DAPT) is widely used to improve Large Language Models on specialized domains, yet its interaction with knowledge distillation (KD) remains poorly understood. In particular, intermediate DAPT checkpoints are rarely analyzed, and the evolution of teacher uncertainty across such checkpoints has not been systematically studied. This thesis develops a unified framework to examine how DAPT reshapes teacher confidence and how these shifts influence KD performance, downstream performance and calibration. Using LLaMA-2-7B teachers adapted for 2,000, 5,000, 7,500, and 10,000 DAPT steps, together with Sheared-LLaMA-1.3B students distilled under four KD variants, we evaluate two biomedical QA benchmarks: PubMedQA and BioASQ. We analyze teacher entropy, entropy–performance correlations, and student Expected Calibration Error (ECE) across checkpoints. Our findings reveal three key insights: (1) teacher entropy shifts moderately with deeper DAPT but redistributes most strongly over semantically informative tokens; (2) moderate entropy reduction yields the strongest KD gains for abstractive reasoning tasks such as PubMedQA, whereas extractive QA tasks benefit more from heavily domain-adapted teachers whose predictions are sharper and more concentrated, and (3) student calibration closely tracks teacher entropy, with sharper teachers generally producing better-calibrated models, though excessively low entropy can introduce calibration trade-offs.

Graduation Semester

2025-12

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/132689

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Enhancing knowledge distillation in large language models via domain adaptation

Zhang, Xitong (Jacqueline)

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In