Files in this item



application/pdfIntegrating Deep Web Data Sources.pdf (952kB)
(no description provided)PDF


Title:Integrating Deep Web Data Sources
Author(s):Wu, Wensheng
Subject(s):computer science
Abstract:A large number of data sources on the Web (e.g., are only accessible through their query interfaces. These sources are commonly known as Deep Web sources. For any domain of interest, there may be many such sources with varied query capabilities and content coverage. As a result, users frequently need to access multiple sources in order to find the desired information, which can be a very time-consuming and labor-expensive process. To address this problem, an effective solution is to build a virtual integration system over the sources. Such a system provides uniform accesses to the sources, thus freeing the users from the details of individual sources. As an important step towards this goal, this dissertation studies the problem of integrating query interfaces of Deep Web sources. Interface integration typically involves three very challenging tasks: (1) schema extraction, which infers the schema of each source query interface from its (HTML) representation; (2) schema matching, which accurately identifies semantic mappings among the attributes from different interfaces; and (3) schema merging, which properly merges the source interfaces into a well-formed global interface based on the identified attribute mappings. This dissertation presents IceQ, a novel and effective interface integration system. In developing IceQ, we address the limitations of existing solutions and make several key contributions. First, we propose a hierarchical modeling of interfaces and develop a novel spatial clustering algorithm to extract the hierarchical schema of query interface. Second, we develop a novel interactive clustering-based matching algorithm to accurately match a large number of schemas and effectively resolve uncertain mappings via user interaction. Third, we develop a question-answering technique to learn attribute instances from the Web to assist in schema matching. Fourth, we propose a novel constraint-based optimization framework for merging schemas and develop an effective merging algorithm based on the idea of clustering aggregation. Extensive experiments have been conducted to evaluate IceQ and the results show that it is highly effective.
Issue Date:2006-07
Genre:Technical Report
Other Identifier(s):UIUCDCS-R-2006-2709
Rights Information:You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).
Date Available in IDEALS:2009-04-21

This item appears in the following Collection(s)

Item Statistics