The Illinois Retrieval Benchmark : A scalable framework for characterizing retrieval-augmented generation via automated fact-checking

Taleka, Bhagyashree

The Illinois Retrieval Benchmark : A scalable framework for characterizing retrieval-augmented generation via automated fact-checking

Taleka, Bhagyashree

Permalink

https://hdl.handle.net/2142/129278

Description

Title

The Illinois Retrieval Benchmark : A scalable framework for characterizing retrieval-augmented generation via automated fact-checking

Author(s)

Taleka, Bhagyashree

Issue Date

2025-05-07

Director of Research (if dissertation) or Advisor (if thesis)

Hwu, Wen-mei

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Retrieval-augmented Generation
Large Language Models
Automated Fact Checking
Evaluation Benchmark
Scalable Dataset

Language

eng

Abstract

Retrieval-Augmented Generation (RAG) combines a retrieval module with a generation module to enhance response accuracy by incorporating external knowledge. To evaluate the robustness, faithfulness, and generation capabilities of RAG systems, numerous evaluation benchmarks have been proposed. However, these benchmarks either focus on evaluating specific modules using large-scale data that provides ground truth documents and query embeddings without access to raw text, or they evaluate RAG outputs using large language models (LLMs) on small datasets, without assessing the behavior or interactions of individual components within the pipeline. Crucially, there exists no benchmark that connects all components of the RAG pipeline and enables systematic end-to-end analysis across retrieval, reranking, and generation. To address these gaps, we propose the Illinois Retrieval Benchmark (IRB), an end-to-end RAG characterization framework designed to enable scalable and customizable dataset generation across diverse domains and retrieval contexts. Our methodology can be used to generate scalable datasets for developing RAG pipelines and evaluating them using pertinent metrics such as context relevance, answer relevance, and answer faithfulness. We also offer a novel approach to generating ground-truth data for user queries. As part of this thesis, we developed a small dataset consisting of 57,671 unique facts, 117,234 unique questions based on these facts, and 2,462,190 document chunks in the vector database to demonstrate the usability and efficacy of our benchmark in evaluating RAG systems across diverse domains and retrieval contexts. Our experiments show that while rerankers improve answer quality with reduced context size, the effectiveness of reranking depends on the quality of base retrieval. Our ablation studies show that existing retrieval methods exhibit significant deficiencies. We also highlight the impact of document-level noise on RAG performance and emphasize the need for accurate and semantically aligned context, particularly when dealing with emerging knowledge domains. Ultimately, we argue that improving LLM performance requires enhancing the retrieval process or continuously fine-tuning models with updated data.

Graduation Semester

2025-05

Type of Resource

Text

Handle URL

https://hdl.handle.net/2142/129278

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

The Illinois Retrieval Benchmark : A scalable framework for characterizing retrieval-augmented generation via automated fact-checking

Taleka, Bhagyashree

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In