Methods for generating visual programs with optimizable vision models

Levine, Joshua

Methods for generating visual programs with optimizable vision models

Levine, Joshua

Permalink

https://hdl.handle.net/2142/124357

Description

Title

Methods for generating visual programs with optimizable vision models

Author(s)

Levine, Joshua

Issue Date

2024-04-29

Director of Research (if dissertation) or Advisor (if thesis)

Hoiem, Derek

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Visual Programming
Visual Question Answering
Large Language Models
Computer Vision
Program Generation

Language

eng

Abstract

End-to-end vision-language models often fail to handle compositional tasks, necessitating alternative approaches for more complex problem-solving. Leveraging the visual programming paradigm, we propose a novel method for composing foundational vision models through program generation to tackle compositional tasks effectively. We investigate prompting and execution strategies that enable the synthesis of fine-tunable code by trainable large language models aimed at improving the effectiveness of the programs in solving vision-language tasks. Capitalizing on the robust compositional reasoning capabilities of large language models (LLMs), we employ pre-trained LLMs to architect programs constructed using a catalog of pre-defined atomic functions. These atomic functions, implemented with pre-trained vision models, serve as the building blocks for the visual programs generated by our system. Our methodology supports programs in various formats, always offering the flexibility to fine-tune the constituent vision models and the LLM code generator. This study concentrates on image-based question-answering. This focus underscores the critical need for advanced compositional reasoning in interpreting and responding to complex visual queries. Our evaluation encompasses the executability and correctness of the produced programs, providing a comprehensive assessment of our approach's effectiveness. This paper lays the groundwork for a subsequent investigation into the joint training of the LLMs and atomic functions, setting the stage for significant advancements in program generation and compositional reasoning in computer vision.

Graduation Semester

2024-05

Type of Resource

Text

Handle URL

https://hdl.handle.net/2142/124357

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Methods for generating visual programs with optimizable vision models

Levine, Joshua

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In