Withdraw
Loading…
Improving speculative retrieval-augmented generation via verifier scoring
Chen, Shixin
Loading…
Permalink
https://hdl.handle.net/2142/129272
Description
- Title
- Improving speculative retrieval-augmented generation via verifier scoring
- Author(s)
- Chen, Shixin
- Issue Date
- 2025-04-30
- Director of Research (if dissertation) or Advisor (if thesis)
- Kindratenko, Volodymyr
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Speculative Decoding
- RAG
- LLMs
- Abstract
- The rapid progress of large language models (LLMs) has revolutionized various natural language processing (NLP) tasks, including text generation, question answering, and summarization. However, the computational cost and latency associated with LLMs remain significant challenges, particularly in real-time applications. What's more, the heavy fine-tuning cost for each downstream task is also unbearable in the preparation phase. To address these issues, researchers have explored various techniques to accelerate both inference and fine-tuning without compromising the quality of the generated text. One of the recent breakthroughs is Speculative Decoding. It leverages a smaller, faster "draft" model to propose multiple candidate drafts that are then verified in parallel by a larger, more generative, and more accurate "verification" model. This method significantly reduces the number of calls to the larger model, thereby speeding up the inference process. Also, the drafter model is much easier to fine-tune than larger models. In this thesis, we are inspired by an innovative extension of the speculative decoding framework by integrating Retrieval-Augmented Generation (RAG). RAG enhances the capabilities of language models by retrieving relevant documents from a knowledge base to inform the generation process. Our research represents a comprehensive analysis Speculative RAG, aims to further optimize the inference pipeline by categorizing the retrieved documents and feeding them to a smaller drafter model, which generates multiple draft continuations in parallel. These drafts are then evaluated by a larger, general-purpose verifier model to select the most accurate and contextually appropriate continuation.
- Graduation Semester
- 2025-05
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129272
- Copyright and License Information
- Copyright 2025 Shixin Chen
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…