Files in this item



application/pdfPAN-DISSERTATION-2020.pdf (3MB)
(no description provided)PDF


Title:Cross-lingual entity extraction and linking for 300 languages
Author(s):Pan, Xiaoman
Director of Research:Ji, Heng
Doctoral Committee Chair(s):Ji, Heng
Doctoral Committee Member(s):Han, Jiawei; Tong, Hanghang; Knight, Kevin
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
entity extraction
entity linking
Abstract:Information provided in languages that people can understand saves lives in crises. For example, the language barrier was one of the main difficulties faced by humanitarian workers responding to the Ebola crisis in 2014. We propose to break language barriers by extracting information (e.g., entities) from a massive variety of languages and ground the information into an existing Knowledge Base (KB) which is accessible to a user in their own language (e.g., a reporter from the World Health Organization who speaks English only). The ambitious goal of this thesis is to develop a Cross-lingual Entity Extraction and Linking framework for 1,000 fine-grained entity types and 300 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify entity name mentions, assign a fine-grained type to each mention, and link it to an English KB if it is linkable. Traditional entity linking methods rely on costly human-annotated data to train supervised learning-to-rank models to select the best candidate entity for each mention. In contrast, we propose a novel unsupervised represent-and-compare approach that can accurately capture the semantic meaning representation of each mention, and directly compare its representation with the representation of each candidate entity in the target KB. First, we leverage a deep symbolic semantic representation of the Abstract Meaning Representation to represent contextual properties of mentions. Then we enrich the representation of each contextual word and entity mention with a novel distributed semantic representation based on cross-lingual joint entity and word embedding. We develop a novel method to generate cross-lingual data that is a mix of entities and contextual words based on Wikipedia. This distributed semantics enables effective entity extraction and linking. Because the joint entity and word embedding space is constructed across languages, we further extend it to all 300 Wikipedia languages and fine-grained entity extraction and linking for 1,000 entity types defined in YAGO. Finally, using knowledge-driven question answering as a case study, we demonstrate the effectiveness of acquiring external knowledge using entity extraction and linking to improve downstream applications.
Issue Date:2020-12-03
Rights Information:Copyright 2020 Xiaoman Pan
Date Available in IDEALS:2021-03-05
Date Deposited:2020-12

This item appears in the following Collection(s)

Item Statistics