|Abstract:||Providing contractual performance assurances in distributed systems is an important and challenging problem. From the users perspective, stringent performance requirements are becoming more critical, which increases the need for predictability. Meanwhile, from the system engineers perspective, distributed systems are driven towards an increasingly larger scale, more integration and higher complexity, making predictable system performance more difficult to achieve.
In this dissertation, we reconcile the above conflicting trends in the context of achieving temporal predictability in distributed server systems. Predictable timing behavior is becoming increasingly important in wide-area networks. It is expected that a predominant majority of Internet clients in the next decade will become embedded devices. As web-based systems are used in progressively more interactive or critical applications that live under the constraints of the physical world in real-time, renewed interest in timing properties is warranted. In fact, projects such as GENI, already promise nation-scale testbeds for a cleanslate Internet re-design. One of the major thrusts fueling the need for such a new design lies in the vision of Internet use in the context of safety-critical applications with real-time constraints. Since GENI and the clean-slate Internet is not yet available, we implement our framework for temporal predictability within the constraints of the current network. To separate out application-independent components, we break the predictability problem into one of generic real-time content distribution services and application-specific time-sensitive end-systems.
Our approach to providing timing predictability is two-fold. First, we investigate proactive mechanisms for ensuring end-to-end timing guarantees in wide-area content distribution systems executing across the current Internet. Second, we investigate reactive mechanisms that diagnose timing violations. In contrast to much prior research on QoS which addressed the problem of restoring proper timing behavior on end-systems assuming correctness of their software implementation, the increased scale of modern distributed systems generates a need for addressing the possibility of software errors or misconfiguration that cause timing violations. Hence, when it comes to the (software-intensive) end-system, our work is directed at self-diagnosis of timing violations rather than on traditional quality of service architectures, which abound in prior work.
Our solution to the first problem above (that of ensuring end-to-end timing guarantees in content distribution systems) involves a decentralized replication scheme that dynamically selects subsets of the content distribution servers for different classes of content so that per-class network latency bounds are achieved. The replication decisions are made autonomously by the servers based on dynamically measured network latencies and workload conditions. The content replication proceeds in a way that balances workload among servers, hence fully utilizing system capacity and avoiding latency bound violations. The efficiency and decentralized nature of the replication scheme enables our solution to scale up to very large content distribution networks.
Solutions to the second problem (namely, the self-diagnosing capability of end-systems) come from scalable learning-based performance problem diagnosis techniques we propose. The increasing complexity of systems has motivated design of machine learning approaches to automate some system management tasks. However, with increase in scale, current approaches suffer from serious scalability issues. We present two scalable learning-based techniques that automatically identify probable causes of timing violations in large server systems with multiple tiers and replicated sites. By incorporating more diagnostic information sources using a temporal segmentation mechanism and applying transfer learning techniques, we achieve both scalability and improved diagnosis accuracy.
The service is implemented and deployed on PlanetLab, a realistic wide-area network platform. Our evaluation results of proactive delay guarantee mechanisms demonstrate that subsecond delay bounds can be guaranteed with a very high probability with very limited or even imperfect global knowledge. In addition, we evaluate our reactive mechanisms for automated performance problem diagnosis against three months of production traces from a real-world distributed application. Experimental results verify the efficacy of our approach: both diagnosis accuracy and scalability are significantly enhanced compared to base-line solutions. The combination of our proactive and reactive time management mechanisms provides opportunity for building services that manipulate real-time content in wide-area networks.