Files in this item



application/pdfWang_Long.pdf (2MB)
(no description provided)PDF


Title:Providing application-aware reliability through OS/hypervisor-level techniques
Author(s):Wang, Long
Director of Research:Iyer, Ravishankar K.
Doctoral Committee Chair(s):Iyer, Ravishankar K.
Doctoral Committee Member(s):Lumetta, Steven S.; Parthasarathy, Madhusudan; Vasudevan, Shobha
Department / Program:Electrical & Computer Eng
Discipline:Electrical & Computer Engr
Degree Granting Institution:University of Illinois at Urbana-Champaign
reliability, virtual machine
system hang
operating system
error detection
error injection
Abstract:Operating systems and hypervisors enable the collection and extraction of rich information on application and system execution characteristics. This thesis describes a Reliability MicroKernel (RMK) architecture, which provides an infrastructure that enables the design and deployment of software modules for providing application-aware error detection and recovery. The purpose of the RMK is to provide an automatic approach for low-latency crash/hang detection and rapid recovery via checkpoint. We first demonstrate how the RMK works in a native system and then enhance the RMK to work in VMs. In a native system, the RMK is installed as a device driver, while in a virtualized system, the RMK is both installed as a device driver in VMs and deployed as a hypercall (which is like a system call) in a hypervisor. Our approach is transparent to applications and VMs, i.e., it is not required to modify or recompile the kernel source code in a native system or in a VM. The implemented RMK modules include OS/application crash detection, system hang detection, and transparent checkpoint. Traditionally, an external hardware watchdog is used to force a system reboot whenever the watchdog is not reset within a predefined timeout interval. The detection latency might be significant because the timeout interval for resetting the watchdog timer is usually a matter of seconds to reduce false alarms. The approach in this thesis enables low-latency OS-hang detection (within hundreds of milliseconds or less) by measuring the count of instructions executed between two consecutive context switches and checking if the count exceeds a predefined threshold value. The RMK is enhanced to support virtualized environments. Specifically, we present the description, implementation, and experimental assessment of VM-μCheckpoint, a VM checkpointing framework to protect both the guest OS and applications against runtime errors. Compared with the existing VM checkpoint techniques, our VM-μCheckpoint has small overhead and rapid recovery, handles non-fail-stop errors, and runs at high frequency (tens of checkpoints per second) to reduce the recomputation necessary when recovering a VM from a failure. The key point of VM-μCheckpoint is that we do an incremental checkpoint by considering the whole memory of the protected VM as part of the checkpoint. The RMK prototype has been implemented in both Linux and Windows systems on a Pentium 4 processor and is also implemented in the Xen VMM. (The Xen hypervisor is recompiled for installing RMK, but the OS of a native system or a VM is not recompiled.) Error injection experiments show that our RMK detects all the crashes and system hangs, and VM-μCheckpoint successfully recovers VMs from all the crashes. Moreover, the experimental evaluation of the RMK using real-world applications shows that we achieve high coverage and low false-positive rates for error detection (e.g., no false positives for system hang detection) as well as low overhead in providing checkpoint and recovery (e.g., an average of 6.3% overhead in VM-μCheckpoint for SPEC benchmark programs with 50 ms checkpoint intervals). We also apply a formal method and analytical/probilistic models to verify the capability of our system hang detection and to study the availability enhancement provided by the RMK.
Issue Date:2011-01-14
Rights Information:Copyright 2010 Long Wang
Date Available in IDEALS:2011-01-14
Date Deposited:December 2

This item appears in the following Collection(s)

Item Statistics