Withdraw
Loading…
Evaluating long context code understanding of large language models
Tian, Jia Le
Loading…
Permalink
https://hdl.handle.net/2142/127252
Description
- Title
- Evaluating long context code understanding of large language models
- Author(s)
- Tian, Jia Le
- Issue Date
- 2024-12-04
- Director of Research (if dissertation) or Advisor (if thesis)
- Zhang, Lingming
- Department of Study
- Siebel School Comp & Data Sci
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Large Language Models
- Language and Computation
- Software Engineering
- Machine Learning
- Abstract
- The rise and adoption of Large Language Models has brought on a new era of Software Engineering, where LLMs are at the center of the development cycle. Within this rapid growth, the need to quantify code understanding of models is ever more critical. Code understanding is a foundational skill relevant to many downstream coding tasks and is a necessary ability for LLMs to possess in order to push their coding potential. To benchmark the code understanding capabilities of Large Language Models, this work introduces the Searching Needle Function (SNF) benchmark, a set of 500 code retrieval problems spanning across five programming languages (Python, Java, TypeScript, Rust and C++) designed to test the code understanding capabilities of LLMs. Inspired by the famous Needle in a HayStack task, the Searching Needle Function task emphasizes the long context code understanding capabilities of models through retrieval within a long input. Unlike the NIH task, however, the SNF task is realistic and is composed entirely of popular real-world open-source code repositories for each language. The SNF task involves retrieving a needle function in its entirety given only (i) a natural language description and (ii) a long context containing the needle function. This work details the construction process and evaluation criteria for the Searching Needle Function task. This work also evaluates the benchmark against 33 state-of-the-art open and closed-source models. Through this extensive evaluation, this work constructs a comprehensive ranking of all current state-of-the-art models. More specifically, the evaluation found that (i) closed-sourced models vastly outperform open-sourced models at long context code understanding (ii) TypeScript and Java are the best languages for current LLMs to understand and (iii) removing comments from the input code context can improve the overall performance of models.
- Graduation Semester
- 2024-12
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/127252
- Copyright and License Information
- Copyright 2024 Jia Le Tian
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…