Files in this item



application/pdfTu_Yuancheng.pdf (2MB)
(no description provided)PDF


Title:English complex verb constructions: identification and inference
Author(s):Tu, Yuancheng
Director of Research:Roth, Dan
Doctoral Committee Chair(s):Shih, Chilin
Doctoral Committee Member(s):Roth, Dan; Girju, Roxana; Hockenmaier, Julia C.
Department / Program:Linguistics
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):computational lexical semantics
Multiword Expressions
Complex Verb Predicates
Light Verb Constructions
Phrasal Verb Constructions
textual entailment
factive and implicative verbs
natural language processing
supervised machine learning
Abstract:The fundamental problem faced by automatic text understanding in Natural Language Processing (NLP) is to identify semantically related pieces of text and integrate them together to compute the meaning of the whole text. However, the principle of compositionality runs into trouble very quickly when real language is examined with its frequent appearance of Multiword Expressions (MWEs) whose meaning is not based on the meaning of their parts. MWEs occur in all text genres and are far more frequent and productive than are generally recognized, and pose serious difficulties for every kind of NLP applications. Given these diverse kinds of MWEs, this dissertation focuses on English verb related MWEs, constructs stochastic models to identify these complex verb predicates within the given context and discusses empirically the significance of this MWE recognition component in the context of Textual Entailment (TE), an intricate semantic inference task that involves various levels of linguistic knowledge and logic reasoning. This dissertation develops high quality computational models for three of the most frequent kinds of English complex verb constructions: Light Verb Construction (LVC), Phrasal Verb Construction (PVC) and Embedded Verb Construction (EVC), and demonstrates empirically their usage in textual entailment. The discriminative model for LVC identification achieves an 86.3% accuracy when trained with groups of either contextual and statistical features. For PVC identification, the learning model reaches 79.4% accuracy, a 41.1% error reduction compared to the baseline. In addition, adding the LVC classifier helps the simple but robust lexical TE system achieve a 39.5% error reduction in accuracy and a 21.6% absolute F1 value improvement. Similar improvements are achieved by adding the PVC and EVC classifiers into this entailment system with a 30.6% and 39.4% absolute accuracy improvement respectively. In this dissertation, two types of automation are achieved with respect to English complex verb predicates: learning to recognize these MWEs within a given context and discovering the significance of this identification within an empirical semantic NLP application, i.e., textual entailment. The lack of benchmark datasets with respect to these special linguistic phenomena is the main bottleneck to advance the computational research in them. The study presented in this dissertation provides two benchmark datasets related to the identification of LVCs and PVCs respectively and three linguistic phenomenon specified TE datasets to automate the investigation of the significance of these linguistic phenomena within a TE system. These datasets enable us to make a direct evaluation and comparison of lexically based models, reveal insightful differences between them, and create a simple but robust improved model combination. In the long run, we believe that the availability of these datasets will facilitate improved models that consider the various special multiword related phenomena within the complex semantic systems, as well as applying supervised machine learning models to optimize model combination and performance.
Issue Date:2012-09-18
Rights Information:Copyright 2012 Yuancheng Tu
Date Available in IDEALS:2012-09-18
Date Deposited:2012-08

This item appears in the following Collection(s)

Item Statistics