Files in this item



application/pdfMEDIATE Learnin ... oss Text and Databases.pdf (295kB)
(no description provided)PDF


Title:MEDIATE: Learning to Match Entity Mentions across Text and Databases
Author(s):Doan, AnHai; Li, Xin; Roth, Dan
Subject(s):computer science
Abstract:Many real-world applications increasingly involve both structured data and text. A given real-world entity is often referred to in different ways, such as ``Helen Hunt'', and ``Mrs. H. E. Hunt'', both within and across the structured data and the text. Due to this {\em semantic heterogeneity}, it remains extremely difficult to glue together information about real-world entities from the available data sources and effectively utilize both types of information. This paper describes the \mediate\ system which automatically matches entity mentions {\em within\/} and {\em across\/} both text and databases. The system can handle multiple types of entities (e.g., people, movies, locations), is easily extensible to new entity types, and operates with no need for annotated training data. Given a relational database and a set of text documents, \mediate\ learns from the data a {\em generative model\/} that provides a probabilistic view on how a data creator might have generated mentions, then applies it to matching the mentions. The model exploits the similarity of mention names, common transformations across mentions, and context information such as age, gender, and entity co-occurrence. To maximize matching accuracy, \mediate\ also propagates information across contexts. Experiments on real-world data show that \mediate\ significantly outperforms existing methods that address aspects of this problem, and that it can exploit text to improve record linkage, and vice versa.
Issue Date:2006-02
Genre:Technical Report
Other Identifier(s):UIUCDCS-R-2006-2692
Rights Information:You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).
Date Available in IDEALS:2009-04-21

This item appears in the following Collection(s)

Item Statistics