Files in this item

FilesDescriptionFormat

application/pdf

application/pdfWANG-DISSERTATION-2018.pdf (25MB)
(no description provided)PDF

Description

Title:Learning joint latent representations for images and language
Author(s):Wang, Liwei
Director of Research:Lazebnik, Svetlana
Doctoral Committee Chair(s):Lazebnik, Svetlana
Doctoral Committee Member(s):Forsyth, David; Hockenmaier, Julia; Schwing, Alexander; Tu, Zhuowen
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):deep learning
computer vision
Abstract:Computer vision is moving from predicting discrete, categorical labels to generating rich descriptions of visual data, in particular, in the form of natural language. Learning the joint latent representations for images and language is vital to solving many image-text tasks, including image-sentence retrieval, visual grounding, and image captioning, etc. In this thesis, we first propose two-branch neural networks for learning the similarity between these two data modalities. Two network structures are proposed to produce different output representations. The first one, referred to as an embedding network, learns an explicit shared latent embedding space with a maximum-margin ranking loss and novel neighborhood constraints. The second network structure, referred to as a similarity network, fuses the two branches via element-wise product and is trained with regression loss to directly predict a similarity score. Extensive experiments show that our networks achieve high accuracies for phrase localization in the Flickr30K Entities dataset and for bi-directional image-sentence retrieval in the Flickr30K and COCO datasets. Then, we explore the image captioning problem using conditional variational auto-encoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space with K components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g. several kinds of objects). The first model uses a Gaussian Mixture model (GMM) prior while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. Experiments show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a “vanilla” CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise. In order to further improve the caption decoder inherited from the AG-CVAE model, we attempt to train it by optimizing caption evaluation metrics (e.g. BLEU scores) using policy gradient from reinforcement learning. The loss function contains two terms: one is maximum likelihood estimator (MLE loss) and the other one is a reinforcement term based on a sum of non-differentiable rewards. Experiments show that training the decoder with this combination loss can help to generate more accurate captions. We also study the problem of ranking generated sentences conditioned on the image input and explore several variants of deep rankers built on top of the two-branch networks proposed earlier.
Issue Date:2018-07-10
Type:Text
URI:http://hdl.handle.net/2142/101544
Rights Information:Copyright 2018 Liwei Wang
Date Available in IDEALS:2018-09-27
Date Deposited:2018-08


This item appears in the following Collection(s)

Item Statistics