Files in this item



application/pdfDiscovering Aud ... os of Human Activities.pdf (2MB)
(no description provided)PDF


Title:Discovering Audio-Visual Associations in Narrated Videos of Human Activities
Author(s):Oezer, Tuna
natural language processing
Abstract:This research presents a novel method for learning the lexical semantics of action verbs. The primary focus is on actions that are directed towards objects, such as kicking a ball or pushing a chair. Specifically, this dissertation presents a robust and scalable method for acquiring grounded lexical semantics by discovering audio-visual associations in narrated videos. The narration associated with the video contains many words, including other verbs that are unrelated to the action. The actual name of the depicted action is only occasionally mentioned by the narrator. More generally, this research presents an algorithm that can reliably and autonomously discover an association between two events, such as the utterance of a verb and the depiction of an action, if the two events are only loosely correlated with each other. Semantics is represented in a grounded way by association sets, a collection of sensory inputs associated with a high level concept. Each association set associates video sequences that depict a given action with utterances of the name of the action. The association sets are discovered in an unsupervised way. This dissertation also shows how to extract features from the video and audio for this purpose. Extensive experimental results are presented. The experiments make use of several hours of video depicting a human performing 13 actions with 6 objects. In addition, the performance of the algorithm was also tested with data provided by an external research group. The unsupervised learning algorithm presented in this dissertation has been compared to standard supervised learning algorithms. This dissertation introduces a number of relevant experimental parameters and various new analysis techniques. The experimental results show that the algorithm presented in this dissertation successfully discovers the correct associations between video scenes and audio utterances in an unsupervised way despite the imperfect correlation between the video and audio. The algorithm outperforms standard supervised learning algorithms. Among other things, this research shows that the performance of the algorithm depends mainly on the strength of the correlation between video and audio, the length of the narration associated with each video scene and the total number of words in the language.
Issue Date:2008-02
Genre:Technical Report
Other Identifier(s):UIUCDCS-R-2008-2920
Rights Information:You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).
Date Available in IDEALS:2009-04-22

This item appears in the following Collection(s)

Item Statistics