Files in this item

FilesDescriptionFormat

application/pdf

application/pdfMaking Holistic ... th Sampling and Voting.pdf (304kB)
(no description provided)PDF

Description

Title:Making Holistic Schema Matching Robust: An Ensemble Framework with Sampling and Voting
Author(s):He, Bin; Chang, Kevin Chen-Chuan
Subject(s):Database Web mining
Abstract:With the prevalence of databases on the Web, \emph{large scale} integration has become a pressing problem. As an essential task, \emph{holistic schema matching} (i.e., discovering attribute correspondences among many schemas) has been actively studied recently. As a ``data mining" approach in nature, holistic schema matching, on one hand, benefits from the large scale of input schema data, while on the other hand, also suffers the problem of noises. Such noises often inevitably arise in the automatic extraction of schema data, which is mandatory in large scale integration. For holistic matching to be viable, it is thus essential to make it robust against noisy schemas. Toward this goal, we propose a novel ``ensemble" framework, which aggregates a multitude of base holistic matchers to achieve robustness, by exploiting statistical sampling and majority voting: To begin with, we observe that Web query interfaces possess two interesting characteristics: 1) ``redundancy of attributes"-- that schemas tend to share attributes, and 2) ``non- dominance of noises"-- that noisy schemas are relatively few. These observations inspire us to develop a generic \emph {ensemble} framework, which consists of \emph{multiple sampling}, \emph{ranking aggregation} and \emph{matching selection}. In essence, our approach creates an ensemble of base holistic matchers, by randomizing the schema data into many \emph{trials} and aggregating their ranked results by taking majority voting. We provide analytic justification of the robustness of the ensemble. Empirically, our experiments show that the ``ensemblization" indeed significantly boosts the matching accuracy, over automatically extracted schema data.
Issue Date:2004-07
Genre:Technical Report
Type:Text
URI:http://hdl.handle.net/2142/10881
Other Identifier(s):UIUCDCS-R-2004-2451
Rights Information:You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).
Date Available in IDEALS:2009-04-16


This item appears in the following Collection(s)

Item Statistics