Harnessing large language models for software engineering
Wei, Yuxiang
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/127392
Description
Title
Harnessing large language models for software engineering
Author(s)
Wei, Yuxiang
Issue Date
2024-12-06
Director of Research (if dissertation) or Advisor (if thesis)
Zhang, Lingming
Department of Study
Siebel School Comp & Data Sci
Discipline
Computer Science
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Large Language Models
Software Engineering
Program Synthesis
Automated Program Repair
Abstract
Automatic code generation, often referred to as program synthesis, has seen remarkable progress with the emergence of large language models (LLMs) trained on extensive datasets of text and programming-related data. This thesis examines the role that LLMs play in contemporary software engineering, with a particular focus on solving programming exercises, infilling code within existing projects, and automated program repair (APR). We begin by showing how effectively LLMs can address self-contained programming challenges. Using methods like step-by-step code synthesis and feedback-driven refinement, we show that the quality of solutions improves significantly when iterative processes are applied. Additionally, we present a novel framework, PairCoder, which leverages a collaborative approach to code generation. In this setup, two LLMs work together, adopting distinct roles: one acts as the driver to generate the core solutions, while the other serves as the navigator, providing guidance, critique, and suggestions. This iterative interplay between the two models results in more accurate and robust code by bringing multiple perspectives to the problem-solving process. When it comes to code infilling, we critically analyze existing methods for generating synthetic data and highlight their limitations when applied to more complex, real-world scenarios. To address these shortcomings, we propose adaptations to these techniques that allow LLMs to better understand and operate within intricate codebases. Lastly, we evaluate the capabilities of LLMs in the field of automated program repair. This includes a comparative analysis of models specialized for code infilling and those designed for sequence-to-sequence generation. We also perform data contamination analyses on popular benchmark datasets to understand the impact of training and testing data overlap on model performance. Through these investigations, this thesis highlights the promising potential of LLMs in addressing a wide range of software engineering challenges and offers insights into how these models can be further refined and optimized for real-world applications.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.