Enhancing knowledge distillation in large language models via domain adaptation
Zhang, Xitong (Jacqueline)
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/132689
Description
Title
Enhancing knowledge distillation in large language models via domain adaptation
Author(s)
Zhang, Xitong (Jacqueline)
Issue Date
2025-12-08
Director of Research (if dissertation) or Advisor (if thesis)
He, Jingrui
Committee Member(s)
Ma, Jiaqi
Department of Study
Information Sciences
Discipline
Bioinformatics
Degree Granting Institution
University of Illinois Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Knowledge Distillation
Large Language Models
Deep Learning
Machine Learning
Artificial Intelligence
Abstract
Domain-Adaptive Pre-Training (DAPT) is widely used to improve Large Language Models on specialized domains, yet its interaction with knowledge distillation (KD) remains poorly understood. In particular, intermediate DAPT checkpoints are rarely analyzed, and the evolution of teacher uncertainty across such checkpoints has not been systematically studied. This thesis develops a unified framework to examine how DAPT reshapes teacher confidence and how these shifts influence KD performance, downstream performance and calibration. Using LLaMA-2-7B teachers adapted for 2,000, 5,000, 7,500, and 10,000 DAPT steps, together with Sheared-LLaMA-1.3B students distilled under four KD variants, we evaluate two biomedical QA benchmarks: PubMedQA and BioASQ. We analyze teacher entropy, entropy–performance correlations, and student Expected Calibration Error (ECE) across checkpoints.
Our findings reveal three key insights: (1) teacher entropy shifts moderately with deeper DAPT but redistributes most strongly over semantically informative tokens; (2) moderate entropy reduction yields the strongest KD gains for abstractive reasoning tasks such as PubMedQA, whereas extractive QA tasks benefit more from heavily domain-adapted teachers whose predictions are sharper and more concentrated, and (3) student calibration closely tracks teacher entropy, with sharper teachers generally producing better-calibrated models, though excessively low entropy can introduce calibration trade-offs.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.