|Abstract:||Advancements in general-purpose computing on GPUs [1, 2, 3, 4, 5, 6, 7] has led to a resurgence of deep learning methods in computer vision. Deep learning techniques have since led to tremendous successes in the field of computer vision. Some of the prominent ones are the progress made on the problems of image classification [8, 9, 10], image segmentation [11, 12, 13, 14, 15], object detection [16, 17, 12, 18] and vision language tasks, e.g. image captioning [19, 20, 21], visual question answering [22, 23, 24, 25, 26] etc. Convolutional neural networks [27, 28] and/or Recurrent Neural Networks  trained to regress to a single value or classify to a single class label are the workhorse of most of these methods. However, many computer vision problems are ambiguous i.e. they have more than one plausible solution. Therefore, we need methods – a) That can estimate the multi-modal (i.e. with multiple peaks) probability distribution in the output space, and b) Produce diverse and meaningful solutions from the estimated multi-modal probability distribution.
In this thesis, we tackle ambiguous problems (which have multiple solutions) such as image colorization, image captioning and scene-graph prediction. Our strategy to generate multiple solutions is as follows – (i) We first generate multiple proposals given the input image. These proposals are not to be confused with the object or region proposals of detection networks. Our proposal encodes the properties/characteristics of the corresponding solution/output, before the output is generated. (ii) Given the proposal, we then generate the solution/output which (approximately) adheres to the constraints specified in the proposal. Multiple proposals allow us to generate multiple solutions.
More than one colorization is feasible for a grey-level image. For image colorization, we develop a regression based method to generate a single colorization. Then, we use histograms as proposals with this regression method. Our histogram proposals encode different color schemes desired in the output colorizations. Thus, using different histograms allows our method to generate realistic multiple colorizations. Histogram proposals are difficult to input for a user, therefore we also develop a method that uses automatically generated (or learned) latent proposals. Our latent proposal method uses a combination of variational auto-encoders  and mixture density net- works  to perform multiple colorization. To the best of our knowledge, this is the first method that demonstrates learned multiple colorizations.
For image captioning, the goal is to produce a sentence (i.e. caption) to describe the input image. Any given image can be described in many ways, therefore multiple captions are correct. We show that convolutional networks can be used to model the language (or output caption), which previously was done using recurrent networks only. We find that convolutional networks produce more entropy in their posteriors for output words. Therefore, more unique words and n-grams get sampled in the output captions. This demonstrates that posterior probability modeled by convolutional neural networks encodes more diversity and is multi-modal. Then, we used part-of-speech as our proposals with our convolutional captioning method to sample multiple captions. Part-of- speech encodes different language/syntactic structure of the output caption. Therefore, our method generates captions that have meaningfully different language or sentence structure. We show that our sampling of captions is – fast, accurate and diverse. Finally, we propose a new method to build scene graphs that uses object-object (or paired object) proposals. Our part-of-speech technique helped add syntactic diversity to the multiple captions. In future work, scene graphs can be used to add semantic diversity to the captioning methods and help obtain diverse captions.