Evaluating long context code understanding of large language models

Tian, Jia Le

Evaluating long context code understanding of large language models

Tian, Jia Le

Permalink

https://hdl.handle.net/2142/127252

Description

Title

Evaluating long context code understanding of large language models

Author(s)

Tian, Jia Le

Issue Date

2024-12-04

Director of Research (if dissertation) or Advisor (if thesis)

Zhang, Lingming

Department of Study

Siebel School Comp & Data Sci

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Large Language Models
Language And Computation
Software Engineering
Machine Learning

Language

eng

Abstract

The rise and adoption of Large Language Models has brought on a new era of Software Engineering, where LLMs are at the center of the development cycle. Within this rapid growth, the need to quantify code understanding of models is ever more critical. Code understanding is a foundational skill relevant to many downstream coding tasks and is a necessary ability for LLMs to possess in order to push their coding potential. To benchmark the code understanding capabilities of Large Language Models, this work introduces the Searching Needle Function (SNF) benchmark, a set of 500 code retrieval problems spanning across five programming languages (Python, Java, TypeScript, Rust and C++) designed to test the code understanding capabilities of LLMs. Inspired by the famous Needle in a HayStack task, the Searching Needle Function task emphasizes the long context code understanding capabilities of models through retrieval within a long input. Unlike the NIH task, however, the SNF task is realistic and is composed entirely of popular real-world open-source code repositories for each language. The SNF task involves retrieving a needle function in its entirety given only (i) a natural language description and (ii) a long context containing the needle function. This work details the construction process and evaluation criteria for the Searching Needle Function task. This work also evaluates the benchmark against 33 state-of-the-art open and closed-source models. Through this extensive evaluation, this work constructs a comprehensive ranking of all current state-of-the-art models. More specifically, the evaluation found that (i) closed-sourced models vastly outperform open-sourced models at long context code understanding (ii) TypeScript and Java are the best languages for current LLMs to understand and (iii) removing comments from the input code context can improve the overall performance of models.

Graduation Semester

2024-12

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/127252

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Evaluating long context code understanding of large language models

Tian, Jia Le

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In