Files in this item



application/pdfTucek_Joseph.pdf (916kB)
(no description provided)PDF


Title:Addressing production run failures dynamically
Author(s):Tucek, Joseph
Director of Research:Zhou, Yuanyuan
Doctoral Committee Member(s):Zhou, Yuanyuan; Sanders, William H.; King, Samuel T.; Song, Dawn
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):software reliability
operating systems
delta execution
flash worm
Abstract:The high complexity of modern software, and our pervasive reliance on that software, has made the problems of software reliability increasingly important. Yet despite advances in software engineering practice, pre-release testing, and automated analysis, reports of highsoftware engineering practice, pre-release testing, and automated analysis, reports of highprofile production failures are still common. This dissertation proposes several run-time techniques to analyze and alleviate software failures dynamically, during production runs. The first technique is low overhead checkpoint, rollback, and re-execution. By allowing a window of time in which a period of execution can be relived, low overhead checkpointing allows expensive analytical steps to be saved for only when they are needed. The second technique is a collection of dynamically insertable run-time analysis tools, which can use information gleaned over multiple analytical runs of the same execution to incrementally build picture of a production run failure more completely than any individual analysis could. Finally, based on my experience with the behavior of programs under failure, and the underlying causes of said failures, this dissertation introduces the concept of, and provides a run time which supports, delta execution. Delta execution is the process of running more than one instance or version of a program, while sharing the majority of issued instructions and state. This dissertation uses delta execution specifically to validate software patches at production run time. These three techniques have been demonstrated in three implemented systems supporting various end-level reliability goals. The first system, called Sweeper, is a run-time defensive system against security bugs. Low overhead checkpointing captures system state until an intrusion tripwire notices an anomaly. The system can then roll back to perform a more thorough (and expensive) analysis of past execution to determine the nature of the exploit. Because of the low overhead of the initial checkpointing, the barriers to widespread deployment are low. Further, because Sweeper can still perform more complex analysis, there is the opportunity to generate strong protective measures, like vulnerability specific execution filters (or VSEFs), which can effectively stop a worm infestation. The implemented Sweeper system imposes only 1% overhead in ordinary operation, and can generate an effective protective measure in only 60 milliseconds. From an analytic model, this is sufficient to minimize the spread of a fast worm to only 5% of the susceptible hosts, even for a worm which spreads 10,000 times faster than any previously observed in the wild. The second system is called Triage. Rather than improving reliability by improving security, Triage attempts to enable the improvement of the underlying code by automating failure diagnosis of production run systems. Production run failures are difficult to address. Such failures commonly are irreproducible in a development environment due to workload or scale issues. As they occured in a production run, these are clearly faults which were not caught and fixed by the developer’s standard pre-release testing. Finally, production runs have stringent restrictions on overhead and privacy. Hence giving the programmer enough insight into the failure to implement a patch is challenging. Triage addresses this by performing failure diagnosis post-hoc at the end-user’s site. Low overhead checkpointing allows the capture of a failing execution, so expensive analysis can be deferred until it is definitely needed. Repeated replays allows the incremental application of a variety of failure analysis techniques, similar to the process a human programmer may undertake. For analysis which generally takes direction from a human, Triage substitutes the results of previous analytical steps. Overall, Triage imposes only 5% overhead in failure free execution, and, if a failure occurs, all of the analysis which requires re-execution is complete within about 5 minutes. In a study with human programmers, the output of Triage analysis reduced the time to patch real software faults by 45%. The third system presented in this dissertation deals with the problems introduced when programmers make changes. Despite testing before release, a large number of software patches are released buggy. Indeed, software patches are generally of such poor quality that to optimize uptime it is better to delay applying even security patches while others act as “volunteer” beta testers uncovering the faults which made it though the vendor’s quality control. However, as Triage’s novel delta analysis diagnostic tool shows, the difference between correct and buggy execution can be minimal. Indeed, a manual study of software patches described in this dissertation shows that many patches should not create large changes in the underlying execution. Hence this dissertation proposes delta execution. If the execution (in terms of instruction streams and data) of the patched and unpatched versions of a program are mostly identical, then it is possible to run both versions mostly in one instruction stream. Only rarely, when the executions do differ, is it necessary to run two sets of instructions. By only running the differing, or delta, segments separately, delta execution allows low overhead production run patch validation which is 12% faster than side-by-side patch validation. Further (and perhaps more important), many of the effects which make patch validation difficult (multithreading, timing sensitivity, and system level nondeterminism) are nullified as they effect the two logical executions inside the one physical execution identically. This dissertation shows that, of ten applications tested, delta execution can validate all of the patches, while traditional side-by-side validation only manages to validate 2.
Issue Date:2012-02-06
Genre:Dissertation / Thesis
Rights Information:Copyright 2011 Joseph Tucek
Date Available in IDEALS:2012-02-06
Date Deposited:2011-12

This item appears in the following Collection(s)

Item Statistics