Temperature-centric investigation of speculative decoding with knowledge distillation
Ouyang, Siru
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/132610
Description
Title
Temperature-centric investigation of speculative decoding with knowledge distillation
Author(s)
Ouyang, Siru
Issue Date
2025-04-10
Director of Research (if dissertation) or Advisor (if thesis)
Han, Jiawei
Department of Study
Siebel School Comp & Data Sci
Discipline
Computer Science
Degree Granting Institution
University of Illinois Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Speculative Decoding
Knowledge Distillation
Large Language Models
Abstract
Speculative decoding stands as a pivotal technique to expedite inference in autoregressive (large) language models. This method employs a smaller \textit{draft} model to speculate a block of tokens, which the \textit{target} model then evaluates for acceptance. Despite a wealth of studies aimed at increasing the efficiency of speculative decoding, the influence of generation configurations on the decoding process remains poorly understood, especially concerning decoding temperatures.
This paper delves into the effects of decoding temperatures on speculative decoding’s efficacy. Beginning with knowledge distillation (KD), we first highlight the challenge of decoding at higher temperatures, and demonstrate KD in a consistent temperature setting could be a remedy. We also investigate the effects of out-of-domain testing sets with out-of-range temperatures. Building upon these findings, we take an initial step to further the speedup for speculative decoding, particularly in a high-temperature generation setting. Our work offers new insights into how generation configurations drastically affect the performance of speculative decoding, and underscores the need for developing methods that focus on diverse decoding configurations. Code is publicly available at \texttt{\url{https://github.com/ozyyshr/TempSpec}}.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.