Files in this item



application/pdfA Holistic Para ... Scale Schema Matching.pdf (1MB)
(no description provided)PDF


Title:A Holistic Paradigm for Large Scale Schema Matching
Author(s):He, Bin
Subject(s):computer science
Abstract:Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise attribute correspondences in isolation. In contrast, this thesis proposes a new matching paradigm, holistic schema matching, to match many schemas at the same time and find all matchings at once. By handling a set of schemas together, we can explore their context information that reflects the semantic correspondences among attributes. Such information is not available when schemas are matched only in pairs. As the realizations of holistic schema matching, we develop two approaches in sequence. To begin with, we develop the MGS framework, which finds simple 1:1 matchings by viewing schema matching as hidden model discovery. Then, to deal with complex matchings, we further develop the DCM framework by abstracting schema matching as correlation mining. Further, to automate the entire matching process, we incorporate the DCM framework with automatically extracted interfaces and find that the inevitable errors in automatic interface extraction may significantly affect the matching result. To make the DCM framework robust against such ``noisy" schemas, we propose to integrate it with an ensemble approach by randomizing the schema data into multiple DCM matchers and aggregating their ranked results by taking majority voting. Last, as our matching algorithms require large-scale schemas in the same domain (e.g., Books and Airfares) as input, we develop an object-focused crawler for effectively collecting query interfaces and a model-differentiation based clustering approach to clustering schemas into their domain hierarchy.
Issue Date:2006-06
Genre:Technical Report
Other Identifier(s):UIUCDCS-R-2006-2652
Rights Information:You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).
Date Available in IDEALS:2009-04-20

This item appears in the following Collection(s)

Item Statistics