|Abstract:||Computer systems today are managed by human administrators who are required to continuously observe the system, analyze its behavior, and activate corrective actions (generally referred to as the Observe-Analyze-Act loop). Automating the OAA loop within real-world systems is a non-trivial problem, but the growing economic incentive associated with making systems self-managing, and significant increase in the computation bandwidth have made OAA automation a promising area of research. The existing choices for OAA automation can be characterized as one of the following: policy-based, feedback-based, empirical or learning-based, and model-based - the available solutions suffer from complexity, brittleness, slow convergence, and have been useful to automate only trivial management scenarios.
This thesis proposes Polus: a methodology for OAA automation using a a model-based approach with integrated learning and feedback. Polus uses models of system behavior for deciding the corrective action to be invoked - it continuously refines models using monitor data, exhaustively searches for an optimal corrective action using constrained optimization, and executes the selected action using a variably aggressive feedback loop. The core architecture of Polus closely resembles that of an Expert System: A Knowledge-base of models for components, workloads, actions, and a Reasoning engine that selects and executes a "feasible" action at run- time.
The details of the Polus methodology consist of: Representation of domain-specific details as models; creation and evolution of these models in an automated fashion; decision-making for the corrective action(s) to be invoked at run-time; handling divergent system behavior during action execution. Polus is the first-of-a-kind in using a model-based approach for OAA automation; by applying the following operational principles, Polus addresses challenges related to model inaccuracies in real- world systems, and the computational complexity of decision- making: 1) Models don't need to be perfectly accurate - they only need to be accurate enough to maintain the relative ordering during action selection; 2) The objective of action selection is not to find the most optimal one, but rather to avoid the worst ones; 3) Creation of models is not a one-time activity - it is a continuous process over the lifetime of the system.
The Polus approach was built and evaluated as an OAA framework for a production storage system (having a limited set of corrective actions). The prototype implementation (referred to as Chameleon) is a resource arbitrator that manages assignment of available storage resources to the host workloads. This mapping must ensure that a minimal number of workloads fail to meet their behavior goals (a QoS violation). Chameleon optimizes the overall system utility by automated invocation of the throttle and unthrottle corrective actions. In our experiments, Chameleon identified, analyzed, and corrected performance violations in 3-14 minutes which compares very favorably with the time a human administrator would have needed. Further, the self-evolving aspect of Chameleon facilitated deployment for large-scale storage systems that service variable workloads on an ever-changing mix of device types.
This thesis is a starting point for applying model-based OAA automation to production systems. We have demonstrated the feasibility of our approach in the context of action sets that have a relatively low resource overhead for invocation, and whose effects can be easily reversed. Several research issues remain to be addressed before Polus can be applied to systems with a wider cost-benefit spectrum of corrective actions; we discuss these issues within the context of the existing design details of Polus.