|Abstract:||Recent research in peer-to-peer and grid computing has made it possible to build Internet scale services such as content distribution, storage service, name service and publish/subscribe. By utilizing large number of service nodes that collaborate in a decentralized fashion, such services can potentially achieve high scalability, availability, reliability and QoS/performance. Despite such potential, building large distributed services and testing them in a real world, widely distributed environment remains a difficult task. This is because first, a wide area environment is full of various network and node failures. Therefore, services targeting such environment must have built-in mechanisms to deal with such failures. Further, such mechanisms must not rely on centralized control, due to the scale of the services. Second, running services in a wide area environment requires system support for deploying, monitoring and controlling the services. However, current computing infrastructures generally lack powerful tools for managing widely distributed services. As a result, service developers often have to resort to ad hoc methods for service management.
In this dissertation we present our research aimed at simplifying the development of large scale distributed services. Large scale services are a special class of large distributed applications. As a result, we focus on addressing the challenges involved in the design, implementation and management of such applications. We first present OCMA, a layered architecture for designing large distributed applications. OCMA divides such applications into three layers: the membership layer that keeps up-to-date information about other nodes in the system; the overlay layer that builds or maintains the overlay structure; and the application layer that carries out the application specific processing. Such functional decomposition not only simplifies the design of service applications, but also facilitates the reuse of components (layers) and the innovation within each component. For example, we have designed two large distributed applications, the DagStream system for locality aware P2P streaming and the Management Overlay Networks (MON) system for distributed management. Both are designed according to the OCMA architecture, and both have explored novel techniques for some of their layers.
Through the implementation of multiple large scale applications, we have extracted a C++ framework called PPF (Protocol Plugin Framework) that can be reused to implement large distributed applications. Using PPF, application developers only need to implement the high level protocol between different application nodes. When the protocol is plugged into PPF, the same code can run in both simulation and real world mode. This minimizes the possibility of introducing bugs when porting simulation code to real world deployment.
MON is not just an example application designed according to OCMA, but also a simple, scalable and lightweight tool we have built for managing service applications running in a wide area environment. MON facilitates the management of such applications by building short-term, on-demand overlay networks that can be used to instantly query and control the distributed application status. Such distributed status query and control allows application developers to quickly detect, diagnose and correct potential application problems. In addition to MON, we have also explored algorithms that can adaptively combine continuous monitoring and dynamic query in order to minimize information management overhead.
The major contributions of this dissertation are as follows. First, we present a layered architecture called OCMA and several design techniques such as on-demand overlay construction and control plane services for designing large scale distributed applications. Second, we provide PPF, a C++ framework that can be reused to implement such applications. Third, we build MON, a powerful tool for dynamically querying and controlling the status of distributed service applications. Such dynamic query and control can facilitate the detection and diagnose of potential application problems.