Files in this item



application/pdfRatinov_Lev.pdf (2MB)
(no description provided)PDF


Title:Exploiting knowledge in NLP
Author(s):Ratinov, Lev
Director of Research:Roth, Dan
Doctoral Committee Chair(s):Roth, Dan
Doctoral Committee Member(s):Han, Jiawei; Zhai, ChengXiang; Mihalcea, Rada
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Machine Learning
Natural language processing (NLP)
Text Classification
Co-reference Resolution
Concept Disambiguation
Information Extraction
Named Entity Recognition
Semi-Supervised Learning.
Abstract:In recent decades, the society depends more and more on computers for a large number of tasks. The first steps in NLP applications involve identification of topics, entities, concepts, and relations in text. Traditionally, statistical models have been successfully deployed for the aforementioned problems. However, the major trend so far has been: “scaling up by dumbing down”- that is, applying sophisticated statistical algorithms operating on very simple or low-level features of the text. This trend is also exemplified, by expressions such as "we present a knowledge-lean approach", which have been traditionally viewed as a positive statement, one that will help papers get into top conferences. This thesis suggests that it is essential to use knowledge in NLP, proposes several ways of doing it, and provides case studies on several fundamental NLP problems. It is clear that humans use a lot of knowledge when understanding text. Let us consider the following text "Carnahan campaigned with Al Gore whenever the vice president was in Missouri." and ask two questions: (1) who is the vice president? (2) is this sentence about politics or sports? A knowledge-lean NLP approach will have a great difficulty answering the first question, and will require a lot of training data to answer the second one. On the other hand, people can answer both questions effortlessly. We are not the first to suggest that NLP requires knowledge. One of the first such large-scale efforts, CYC, has started in 1984, and by 1995 has consumed a person-century of effort collecting 100000 concepts and 1000000 commonsense axioms, including "You can usually see peoples noses, but not their hearts". Unfortunately, such an effort has several problems. (a) The set of facts we can deduce is significantly larger than 1M . For example, in the above example "heart" can be replaced by any internal organ or tissue, as well as by a bank account, thoughts etc., leading to thousands of axioms. (b) The axioms often do not hold. For example, if the person is standing with their back to you, can cannot see their nose. And during an open heart surgery, you can see someone's heart. (c) Matching the concepts to natural-language expressions is challenging. For example, "Al Gore" can be referred to as "Democrat", "environmentalist", "vice president", "Nobel prize laureate" among other things. The idea of "buying a used car" can be also expressed as "purchasing a pre-owned automobile". Lexical variability in text makes using knowledge challenging. Instead of focusing on obtaining a large set of logic axioms, we are focusing on using knowledge-rich features in NLP solutions. We have used three sources of knowledge: a large corpus of unlabeled text, encyclopedic knowledge derived from Wikipedia and first-order-logic-like constraints within a machine learning framework. Namely, we have developed a Named Entity Recognition system which uses word representations induced from unlabeled text and gazetteers extracted from Wikipedia to achieve new state of the art performance. We have investigated the implications of augmenting text representation with a set of Wikipedia concepts. The concepts can either be directly mentioned in the documents, or not explicitly mentioned but closely related. We have shown that such document representation allows more efficient search and categorization than the traditional lexical representations. Our next step is using the knowledge injected from Wikipedia for co-reference resolution. While the majority of the knowledge in this thesis is encyclopedic, we have also investigated how knowledge about the structure of the problem in the form of constraints can allow leveraging unlabeled data in semi-supervised settings. This thesis shows how to use knowledge to improve state-of-the-art for four fundamental problems in NLP: text categorization, information extraction, concept disambiguation and coreference resolution, four tasks which have been considered the bedrock of NLP since its inception.
Issue Date:2012-05-22
Rights Information:Copyright 2012 Lev Ratinov
Date Available in IDEALS:2012-05-22
Date Deposited:2012-05

This item appears in the following Collection(s)

Item Statistics