IDEALS Home University of Illinois at Urbana-Champaign logo The Alma Mater The Main Quad

Troubleshooting interactive complexity bugs

Show full item record

Bookmark or cite this item: http://hdl.handle.net/2142/29503

Files in this item

File Description Format
PDF khan_mohammad.pdf (2MB) (no description provided) PDF
Title: Troubleshooting interactive complexity bugs
Author(s): Khan, Mohammad M.
Director of Research: Abdelzaher, Tarek F.
Doctoral Committee Chair(s): Abdelzaher, Tarek F.
Doctoral Committee Member(s): Han, Jiawei; Sha, Lui; Liu, Jie
Department / Program: Computer Science
Discipline: Computer Science
Degree Granting Institution: University of Illinois at Urbana-Champaign
Degree: Ph.D.
Genre: Dissertation
Subject(s): interactive complexity bugs discriminative sequence mining troubleshooting
Abstract: The term “interactive complexity” was introduced by Charles Perrow in his famous book Normal Accidents: Living with High-Risk Technologies [1]. He used the term to describe the interacting tendency of systems with large number of components. He argued that, in systems with large number of components, multiple failures often interact in some unexpected way, leading to catastrophic failures in systems such as planes or nuclear power plants. He also suggested that with increasing interactive complexity and tight coupling, unexpected interactions of failures are bound to happen. Indeed, with the proliferation of Internet enabled cheap embedded devices with built in sensors and actuators (e.g., smart phones, smart appliances), the physical world is increasingly becoming an integral part of the logical world of computation. As computing systems are becoming much more interactive and responsive to the surrounding physical environments, it is becoming increasingly difficult to test such systems to full extent before deployment in real world. Hence, due to increased interactive complexity and tight coupling between physical and logical world, such systems often fail or preform poorly once deployed in real life. Unintended interactions among various system components, or across computing systems and physical environments are often to blame for the problem. With this growing trend, the bugs that arise due to interaction among different distributed components across multiple nodes are likely to get worse, and are going to affect the reliability of the system significantly. This calls for new tools and techniques to troubleshoot future software systems. In this dissertation, we address this significant challenge of troubleshooting interactive complexity bugs in emerging cyber-physical systems using data mining techniques. More specifically, we applied discriminative sequence mining algorithm to isolate chains of events (not necessarily contiguous) that is causally correlated to failure by analyzing system logs. In the first part of our thesis, using our tool, we successfully identified multiple bugs in various real systems such as multi-channel MAC (medium access control) layer protocol for wireless sensor network [2], kernel level race condition bug in the LiteOS operating system, and corner case design flaw in the directed diffusion protocol [3]. Next, we extended our approach to identify “symbolic” patterns, where absolute values are replaced with abstract symbols whenever appropriate to identify more subtle patterns across multiple system logs. Next, we have examined the applicability of our approach to troubleshoot harmful interactive complexity that may arise due to poor integration of adaptive components in server clusters. More specifically, we extended our approach to identify “cyclic” patterns in data center applications, which potentially highlights self-reinforcing loops. Finally, to complement our work on troubleshooting interactive complexity, we address the challenge of diagnosing occasional “lack of interaction” in deployed system. Such “lack of interaction” is often caused by unresponsive nodes. We develop the tele-diagnostic powertracer, an in-situ troubleshooting tool that uses external power measurements to determine the internal health condition of an unresponsive host and the most likely cause of its failure. Using our tool, we successfully distinguish between several categories of failures that cause unresponsive behavior including energy depletion, antenna damage, radio disconnection, system crashes, and anomalous reboots. To the best of our knowledge, we are the first to present a diagnostic tool that uses power measurements to diagnose sensor system failures remotely.
Issue Date: 2012-02-01
URI: http://hdl.handle.net/2142/29503
Rights Information: Copyright 2011 Mohammad Maifi Hasan Khan.
Date Available in IDEALS: 2014-02-01
Date Deposited: 2011-12
 

This item appears in the following Collection(s)

Show full item record

Item Statistics

  • Total Downloads: 20
  • Downloads this Month: 1
  • Downloads Today: 0

Browse

My Account

Information

Access Key