|Abstract:||Dependability is becoming a requirement in an increasing number of domains, including those that were previously thought to be noncritical. Examples include large distributed systems deployed in domains such as e-commerce, information mining, messaging, and entertainment. Such systems provide a challenge to existing fault tolerance approaches because of their requirements for low-cost solutions that can be adapted to work with off-the-shelf components. At the same time, their scale makes it difficult to accurately diagnose faults and recover from them.
This dissertation proposes a model-based solution to building a theoretically well-founded recovery framework based on partially observable Markov decision processes that is inexpensive to deploy, can cope with a variety of recovery mechanisms, and can tolerate system monitoring that may be imperfect, imprecise, or conflicting, and at the same time can generate recovery decisions that ensure that recovery will be stable, provide guarantees on the success of the recovery, and recover the system while incurring as low a cost as possible, thus approximating optimality.
We are unaware of any other framework for recovery in distributed systems that integrates monitoring and recovery in an iterative manner, is able to deal with imprecise system states and selectively choose actions that either gather information or make progress towards recovery, and generates recovery policies that minimize costs over entire sequences of recovery actions. We have implemented a tool called the .Adaptation and Recovery Management framework. that implements our approach. We demonstrate that this tool can be used to provide diagnosis and recovery capabilities in practical information systems.