|Abstract:||The rapid growth of distributed data has fueled significant interest in building data integration systems. However, developing these systems today still requires an enormous amount of labor from system builders. Several nontrivial tasks must be performed, such as wrapper construction and mapping between schemas. Then, in dynamic environments such as the Web, sources often undergo changes that break the system, requiring the builder to continually invest maintenance effort. This has resulted in very high cost of ownership for integration systems, and severely limited their deployment in practice.
In this thesis I investigate three approaches to reducing this cost. First, I follow the approach taken by previous work and develop a tool for automating a key bottleneck. In particular, I develop MAVERIC, an automatic solution to detecting broken mappings. An extensive empirical evaluation shows that MAVERIC outperforms previous work, alleviating the need for the system builder to continually and exhaustively monitor the system for broken sources.
However, a commonality across previous work is that integration tools often require human intervention to correct mistakes and build functioning systems. As a result there has been a persistent need for effort from system builders. Thus, in this thesis I investigate a conceptually new approach, mass collaboration, to the key integration task of schema matching. As far as I know, this is the first work to apply a Web 2.0-style collaborative approach to schema matching. Experiments show that by leveraging MOBS, my implementation of this idea, non-expert users can be used to improve the accuracy of matching tools, in turn significantly reducing builder workload.
While the previous two directions reduce integration costs by improving the performance of automatic tools (either by improvements to the tool itself, or by leveraging users to boost tool accuracy), the last direction explored in this thesis attacks data integration costs at their foundation -- rigidity. The current data integration system model imposes a very rigid structure on its components and the data that is passed between components. For example, wrappers are responsible for extracting precise structured data, allowing traditional structured query processing techniques to compute the query result. However, my third direction explores our ability to relax these assumptions, thereby allowing us to answer queries without suffering unnecessary costs required in the traditional model (e.g., building full-fledged wrappers). In this thesis I investigate this idea within the context of supporting one-time, on-the-fly queries over distributed Web data. I develop and evaluate SLIC, a system that allows a user to quickly pose SQL queries over multiple sources (after only some minimal preprocessing), obtain initial results, then iterate with the system to get increasingly better results. The fundamental idea is to learn only as much structure as necessary to answer a given query. Extensive experiments on real-world domains show that for many practical queries SLIC is significantly faster than current methods, thus providing a promising first step toward a principled solution for lazy, on-the-fly integration of Web data, and hopefully sparking interest in our potential to remove some of the fundamental costs inherent in the traditional integration system model.