Withdraw
Loading…
Learning controllable visual representations: advancing spatial and 3D primitive guidance for image synthesis
Vavilala, Vaibhav
Loading…
Permalink
https://hdl.handle.net/2142/129884
Description
- Title
- Learning controllable visual representations: advancing spatial and 3D primitive guidance for image synthesis
- Author(s)
- Vavilala, Vaibhav
- Issue Date
- 2025-07-16
- Director of Research (if dissertation) or Advisor (if thesis)
- Forsyth, David
- Doctoral Committee Chair(s)
- Forsyth, David
- Committee Member(s)
- Hoiem, Derek
- Gupta, Saurabh
- Fouhey, David
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Diffusion Models, 3D Primitives, multimodal image synthesis, scene parsing
- Abstract
- Primitives have been a longstanding interest in computer vision because they can simplify reasoning about images and 3D data. Our work dramatically advances this area, providing an efficient method to obtain 3D primitive representations from any RGB image. In the chapter Convex Decomposition of Indoor Scenes, we show how we can fit 3D primitives to complex cluttered indoor scenes, focusing on the benchmark NYUv2 dataset. We depart from classic primitive fitting that decomposes 3D meshes and show how a single depth map is sufficient during training and inference. Further, we demonstrate that a two-stage method at test-time is effective: regression - running a neural net to obtain initial primitive predictions, followed by optimization - refining the primitives with respect to the original training losses. From there, in Improved Convex Decomposition with Ensembling and Boolean Primitives we scale the dataset to over a million images, demonstrating which assumptions matter for in-the-wild scenes. We then show how to fit CSG (Constructive Solid Geometry) representations, enriching the shapes we can encode. The primitive fitting problem is unique in that we do not know the optimal number of primitives for a given test image. Thus, we show how ensembling at test-time can help us choose. Finally, in Generative Blocks World, we establish why primitives are useful. While text-to-image models abound, users wish for precise 3D control over their outputs in a user-friendly way. We demonstrate that our primitives are excellent for this multimodal task, combining the latest image diffusion models with our 3D primitive representations. Our method allows us to move the camera or individual objects in a 3D-aware way, using our primitives as an intermediate abstraction.
- Graduation Semester
- 2025-08
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129884
- Copyright and License Information
- Copyright 2025 Vaibhav Vavilala
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…