Files in this item



application/pdfHUNG-THESIS-2020.pdf (2MB)
(no description provided)PDF


Title:Visual relationship understanding
Author(s):Hung, Zih-Siou
Advisor(s):Lazebnik, Svetlana
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Visual Relationship Detection
Action Recognition
Abstract:This thesis addresses two visual understanding tasks: visual relationship detection (VRD) and video action recognition. The majority of the thesis is focused on VRD, which is our main contribution. Relations amongst entities play a central role in image and video understanding. In the first three chapters, we discuss visual relationship detection, whose goal is to recognize all (subject, predicate, object) tuples in a given image. Due to the complexity of modeling (subject, predicate, object) relation triplets, it is crucial to develop a method that can not only recognize seen relations, but also generalize to unseen cases. Inspired by a previously proposed visual translation embedding model, or VTransE [1], we propose a context-augmented translation embedding model that can capture both common and rare relations. The previous VTransE model maps entities and predicates into a low-dimensional embedding vector space where the predicate is interpreted as a translation vector between the embedded features of the bounding box regions of the subject and the object. Our model additionally incorporates the contextual information captured by the bounding box of the union of the subject and the object, and learns the embeddings guided by the constraint predicate = union (subject, object) - subject - object. In a comprehensive evaluation on multiple challenging benchmarks, our approach outperforms previous translation-based models and comes close to or exceeds the state of the art across a range of settings, from small-scale to large-scale datasets, from common to previously unseen relations. It also achieves promising results for the recently introduced task of scene graph generation. In the final part of the thesis, we consider action understanding in videos. In many scenarios, we observe moving objects instead of still images. Thus, it is also important to capture motion information and recognize the action being performed. Recent work either applies 3D convolution operators to extract the motion implicitly or adds an additional optical flow path to leverage temporal features. In our work, we propose to use a novel correlation operator to establish a matching between consecutive frames. This matching encodes the movement of objects through time. Combined with the classical appearance stream, the proposed method hence learns the appearance and motion representations in parallel. On the challenging Something-Something dataset [2], we empirically demonstrate that our network achieves comparable performance to the state-of-the-art method.
Issue Date:2020-05-12
Rights Information:Copyright 2020 Zih-Siou Hung
Date Available in IDEALS:2020-08-26
Date Deposited:2020-05

This item appears in the following Collection(s)

Item Statistics