|Abstract:||A large number of data sources on the Web (e.g., Amazon.com) are only accessible through their query interfaces. These sources are commonly known as Deep Web sources. For any domain of interest, there may be many such sources with varied query capabilities and content coverage. As a result, users frequently need to access multiple sources in order to find the desired information, which can be a very time-consuming and labor-expensive process. To address this problem, an effective solution is to build a virtual integration system over the sources. Such a system provides uniform accesses to the sources, thus freeing the users from the details of individual sources.
As an important step towards this goal, this dissertation studies the problem of integrating query interfaces of Deep Web sources. Interface integration typically involves three very challenging tasks: (1) schema extraction, which infers the schema of each source query interface from its (HTML) representation; (2) schema matching, which accurately identifies semantic mappings among the attributes from different interfaces; and (3) schema merging, which properly merges the source interfaces into a well-formed global interface based on the identified attribute mappings.
This dissertation presents IceQ, a novel and effective interface integration system. In developing IceQ, we address the limitations of existing solutions and make several key contributions. First, we propose a hierarchical modeling of interfaces and develop a novel spatial clustering algorithm to extract the hierarchical schema of query interface. Second, we develop a novel interactive clustering-based matching algorithm to accurately match a large number of schemas and effectively resolve uncertain mappings via user interaction. Third, we develop a question-answering technique to learn attribute instances from the Web to assist in schema matching. Fourth, we propose a novel constraint-based optimization framework for merging schemas and develop an effective merging algorithm based on the idea of clustering aggregation. Extensive experiments have been conducted to evaluate IceQ and the results show that it is highly effective.