Withdraw
Loading…
The Illinois Retrieval Benchmark : A scalable framework for characterizing retrieval-augmented generation via automated fact-checking
Taleka, Bhagyashree
Loading…
Permalink
https://hdl.handle.net/2142/129278
Description
- Title
- The Illinois Retrieval Benchmark : A scalable framework for characterizing retrieval-augmented generation via automated fact-checking
- Author(s)
- Taleka, Bhagyashree
- Issue Date
- 2025-05-07
- Director of Research (if dissertation) or Advisor (if thesis)
- Hwu, Wen-mei
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Retrieval-Augmented Generation, Large Language Models, Automated Fact Checking, Evaluation Benchmark, Scalable Dataset
- Abstract
- Retrieval-Augmented Generation (RAG) combines a retrieval module with a generation module to enhance response accuracy by incorporating external knowledge. To evaluate the robustness, faithfulness, and generation capabilities of RAG systems, numerous evaluation benchmarks have been proposed. However, these benchmarks either focus on evaluating specific modules using large-scale data that provides ground truth documents and query embeddings without access to raw text, or they evaluate RAG outputs using large language models (LLMs) on small datasets, without assessing the behavior or interactions of individual components within the pipeline. Crucially, there exists no benchmark that connects all components of the RAG pipeline and enables systematic end-to-end analysis across retrieval, reranking, and generation. To address these gaps, we propose the Illinois Retrieval Benchmark (IRB), an end-to-end RAG characterization framework designed to enable scalable and customizable dataset generation across diverse domains and retrieval contexts. Our methodology can be used to generate scalable datasets for developing RAG pipelines and evaluating them using pertinent metrics such as context relevance, answer relevance, and answer faithfulness. We also offer a novel approach to generating ground-truth data for user queries. As part of this thesis, we developed a small dataset consisting of 57,671 unique facts, 117,234 unique questions based on these facts, and 2,462,190 document chunks in the vector database to demonstrate the usability and efficacy of our benchmark in evaluating RAG systems across diverse domains and retrieval contexts. Our experiments show that while rerankers improve answer quality with reduced context size, the effectiveness of reranking depends on the quality of base retrieval. Our ablation studies show that existing retrieval methods exhibit significant deficiencies. We also highlight the impact of document-level noise on RAG performance and emphasize the need for accurate and semantically aligned context, particularly when dealing with emerging knowledge domains. Ultimately, we argue that improving LLM performance requires enhancing the retrieval process or continuously fine-tuning models with updated data.
- Graduation Semester
- 2025-05
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129278
- Copyright and License Information
- Copyright 2025 Bhagyashree Taleka
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…