Withdraw
Loading…
Automatic short answer grading at scale using Large Language Models
Zhao, Chenyan
Loading…
Permalink
https://hdl.handle.net/2142/129919
Description
- Title
- Automatic short answer grading at scale using Large Language Models
- Author(s)
- Zhao, Chenyan
- Issue Date
- 2025-07-18
- Director of Research (if dissertation) or Advisor (if thesis)
- Silva, Mariana
- Department of Study
- Siebel School Comp & Data Sci
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Automatic Short Answer Grading
- Language
- eng
- Abstract
- Providing evaluations to student work is a critical component of effective student learning, and automating its process can significantly reduce the workload on human graders. While for many years natural language processing models have struggled to perform well on tasks related to mathematical texts, recent developments in natural language processing have created the opportunity to complete the task of giving students instant feedback on their mathematical proofs. We first present a set of training methods and models capable of autograding freeform mathematical proofs by leveraging existing Large Language Models (LLMs) and other machine learning techniques. The models are trained using proof data collected from four different proof by induction problems. We use four different robust large language models to compare their performances, and all achieve satisfactory performances to various degrees. Additionally, we recruit human graders to grade the same proofs as the training data, and find that the best grading model is also more accurate than most human graders in terms of grading accuracy. With the development of these grading models, we create and deploy an autograder for proof by induction problems and perform a user study with students. Results from the study shows that students are able to make significant improvements to their proofs using the feedback from the autograder, but students still do not trust the AI autograders as much as they trust human graders. We then present an Automatic Short Answer Grading (ASAG) pipeline leveraging state-of-the-art generative LLMs. Our new LLM-based ASAG pipeline achieves better performances than existing custom-built models on the same datasets. We also compare the grading performance of three OpenAI models: GPT-4, GPT-4o, and o1-preview. Our results demonstrate that GPT-4o achieves the best balance between accuracy and cost-effectiveness. On the other hand, o1-preview, despite higher accuracy, exhibits a larger variance in error that makes it less practical for classroom use. We investigate the effects of incorporating instructor-graded examples into prompts using no examples, random selection, and Retrieval-Augmented Generation (RAG)-based selection strategies. Our findings indicate that providing graded examples enhances grading accuracy, with RAG-based selection outperforming random selection. Additionally, integrating grading rubrics improves accuracy by offering a structured standard for evaluation.
- Graduation Semester
- 2025-08
- Type of Resource
- Text
- Handle URL
- https://hdl.handle.net/2142/129919
- Copyright and License Information
- Copyright 2025 Chenyan Zhao
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Siebel School of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…