Harnessing large language models for software engineering

Wei, Yuxiang

Harnessing large language models for software engineering

Wei, Yuxiang

This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.

Permalink

https://hdl.handle.net/2142/127392

Description

Title

Harnessing large language models for software engineering

Author(s)

Wei, Yuxiang

Issue Date

2024-12-06

Director of Research (if dissertation) or Advisor (if thesis)

Zhang, Lingming

Department of Study

Siebel School Comp & Data Sci

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Large Language Models
Software Engineering
Program Synthesis
Automated Program Repair

Abstract

Automatic code generation, often referred to as program synthesis, has seen remarkable progress with the emergence of large language models (LLMs) trained on extensive datasets of text and programming-related data. This thesis examines the role that LLMs play in contemporary software engineering, with a particular focus on solving programming exercises, infilling code within existing projects, and automated program repair (APR). We begin by showing how effectively LLMs can address self-contained programming challenges. Using methods like step-by-step code synthesis and feedback-driven refinement, we show that the quality of solutions improves significantly when iterative processes are applied. Additionally, we present a novel framework, PairCoder, which leverages a collaborative approach to code generation. In this setup, two LLMs work together, adopting distinct roles: one acts as the driver to generate the core solutions, while the other serves as the navigator, providing guidance, critique, and suggestions. This iterative interplay between the two models results in more accurate and robust code by bringing multiple perspectives to the problem-solving process. When it comes to code infilling, we critically analyze existing methods for generating synthetic data and highlight their limitations when applied to more complex, real-world scenarios. To address these shortcomings, we propose adaptations to these techniques that allow LLMs to better understand and operate within intricate codebases. Lastly, we evaluate the capabilities of LLMs in the field of automated program repair. This includes a comparative analysis of models specialized for code infilling and those designed for sequence-to-sequence generation. We also perform data contamination analyses on popular benchmark datasets to understand the impact of training and testing data overlap on model performance. Through these investigations, this thesis highlights the promising potential of LLMs in addressing a wide range of software engineering challenges and offers insights into how these models can be further refined and optimized for real-world applications.

Graduation Semester

2024-12

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/127392

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Harnessing large language models for software engineering

Wei, Yuxiang

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In