Files in this item



application/pdfSeo_Eunsoo.pdf (6MB)
(no description provided)PDF


Title:Failure diagnosis in distributed systems
Author(s):Seo, Eunsoo
Director of Research:Abdelzaher, Tarek F.
Doctoral Committee Chair(s):Abdelzaher, Tarek F.
Doctoral Committee Member(s):Han, Jiawei; Vaidya, Nitin H.; Ko, Steven
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Bug Diagnosis
Concurrency Bugs
Error Propagation
Abstract:Failures in computing systems are unavoidable. Therefore, it is important to detect and diagnose failures early to improve the reliability of systems. In this dissertation, new approaches on root-cause diagnosis for two notorious types of failures in distributed systems are introduced. This dissertation first focuses on the failures that are caused by software bugs triggered by race conditions. Due to the non-deterministic manifestation, these bugs are much harder to diagnose, fix and test than the bugs in sequential logic. To understand the concurrency bugs, we first study the characteristics of concurrency bugs using 105 bugs of four representative open-source programs. Motivated by the interesting findings from the study, we also propose an automatic bug diagnosis tool for distributed programs that finds the minimal causal orders of related events that trigger the bugs. Our tool is a significant extension to the previous tools that can find only bug-triggering sequence of events. The second focus of this dissertation is on the failures that are caused by propagating errors. An error started by a single network component propagates and contaminates other components. As a result, a large number of network components are infected by errors. To fix the problem, root-cause of this problem, the single component that started the error propagation, needs to be identified. It is assumed that only a limited view on the status of components -- whether they are infected or not -- are available through monitors, a set of pre-selected network components. For this problem, we propose two root-cause diagnosis tools. The first tool relies on a simple intuition that the root-cause component is likely to be close to the infected monitors and far from the uninfected monitors. We also compare six different monitor selection methods. The second tool makes use of additional information -- failure propagation probability and time of infections -- to improve the accuracy of root-cause diagnosis. We propose approximation algorithms to calculate the likelihood that a node is the failure source. In addition, we also propose a new monitor selection algorithm that maximizes the number of infected monitors for best accuracy of root-cause diagnosis.
Issue Date:2012-09-18
Rights Information:Copyright 2012 Eun Soo Seo
Date Available in IDEALS:2012-09-18
Date Deposited:2012-08

This item appears in the following Collection(s)

Item Statistics