Files in this item



application/pdfRAMNATH-THESIS-2021.pdf (16MB)
(no description provided)PDF


Title:Fact-based visual question answering using knowledge graph embeddings
Author(s):Ramnath, Kiran
Advisor(s):Hasegawa-Johnson, Mark
Department / Program:Electrical & Computer Eng
Discipline:Electrical & Computer Engr
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Visual Question Answering
Knowledge Graphs
Abstract:Humans have a remarkable capability to learn new concepts, process them in relation to their existing mental models of the world, and seamlessly leverage their knowledge and experiences while reasoning about the outside world perceived through vision and language. Fact-based Visual Question Answering (FVQA), a challenging variant of VQA, requires a QA-system to mimic this human ability. It must include facts from a diverse knowledge graph (KG) in its reasoning process to produce an answer. Large KGs, especially common-sense KGs, are known to be incomplete, i.e., not all non-existent facts are always incorrect. Therefore, being able to reason over incomplete KGs for QA is a critical requirement, in real-world applications, that has not been addressed extensively in the literature. We develop a novel QA architecture that allows us to reason over incomplete KGs, something current FVQA state-of-the-art (SOTA) approaches lack due to their critical reliance on fact retrieval. We use KG embeddings, a technique widely used for KG completion, for the downstream task of FVQA. We also present a new image representation technique we call image-as-knowledge which posits that an image is a collection of knowledge concepts describing each entity present in it. We also show that KG embeddings hold complementary information to word embeddings. A combination of both metrics permits performance comparable to SOTA methods in the standard answer retrieval task, and significantly better (26% absolute) in the proposed missing-edge reasoning task. The next research problem pursued is extending the accessibility of such systems to users through a speech interface and providing support to multiple languages, which have not been addressed in prior studies. We present a new task and a synthetically generated dataset to do Fact-based Visual Spoken-Question Answering (FVSQA). FVSQA is based on the FVQA dataset, with the difference being that the question is spoken rather than typed. Three sub-tasks are proposed: (1) speech-to-text based, (2) end-to-end, without speech-to-text as an intermediate component, and (3) cross-lingual, in which the question is spoken in a language different from that in which the KG is recorded. The end-to-end and cross-lingual tasks are the first to require world knowledge from a multi-relational KG as a differentiable layer in an end-to-end spoken language understanding task, hence the proposed reference implementation is called Worldly-Wise (WoW). WoW is shown to perform end-to-end cross-lingual FVSQA at the same levels of accuracy across three languages - English, Hindi, and Turkish.
Issue Date:2021-04-27
Rights Information:Copyright 2021 Kiran Ramnath
Date Available in IDEALS:2021-09-17
Date Deposited:2021-05

This item appears in the following Collection(s)

Item Statistics