|Abstract:||The term 3D parsing refers to the process of segmenting and labeling the 3D space into expressive categories of voxels, point clouds or surfaces. Humans can effortlessly perceive the 3D scene and the unseen part of an object from a single image with a limited field of view. In the same sense, a robot that is designed to execute a few human-like actions should be able to infer the 3D visual world, from a single snapshot of a 2D sensor such as a camera, or a 2.5D sensor such as a Kinect depth equipment. In this thesis, we focus on 3D scene and object parsing from a single image, aiming to produce a 3D parse that is able to support applications like robotics and navigation.
Our goal is to produce an expressive 3D parse: e.g., what is it, where is it, how can humans move and interact with it. Inferring such a 3D parse from a single image is not trivial. The main challenges are: the unknown separation of layout surfaces and objects; the high degree of occlusions and the diverse classes of objects in the cluttered scene; how to represent 3D object geometry in a way that can be predicted from noisy or partial observations, and can help assist reasoning like contact, support and extent. In this thesis, we put forward the hypothesis and prove in experiments, that a data-driven approach is able to directly produce a complete 3D recovery from 2D partial observations. Moreover, we show that by imposing constraints of 3D patterns and priors into the learned model (e.g., layout surfaces are flat and orthogonal to adjacent surfaces, support height can reveal the full extent of an occluded object, 2D complete silhouettes can guide reconstructions beyond partial foreground occlusions, and a shape can be decomposed into a set of simple parts), we are able to obtain a more accurate reconstruction of the scene and a structural representation of the object.
We present our approaches at different levels of detail, from a rough layout level to a more complex scene level and finally to the most detailed object level. We start by estimating the 3D room layout from a single RGB image, proposing an approach that generalizes across panoramas and perspective images, cuboid layouts and more general layouts (e.g., “L”-shape room). We then make use of an additional depth image, explore at the scene level to recover the complete 3D scene with layouts and all objects jointly. At the object level, we propose to recover each 3D object with robustness to possible partial foreground occlusions. Finally, we represent each 3D object as a 3D composite of sets of primitives, recurrently parsing each shape into primitives given a single depth view. We demonstrate the efficacy of each proposed approach with extensive experiments both quantitatively and qualitatively on public datasets.