Fault injections on mission-critical computer systems
Devnani, Lavin R.
- Fault injections on mission-critical computer systems
- Devnani, Lavin R.
- Issue Date
- Director of Research (if dissertation) or Advisor (if thesis)
- Iyer, Ravishankar K.
- Kalbarczyk, Zbigniew T.
- Department of Study
- Electrical & Computer Eng
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Degree Level
- Resiliency, Fault Injections, Mission Critical, Power Grid, High-performance Computing, Software-defined Networking, Reliability
- This thesis presents two unique sets of fault injections on mission-critical computer systems with the goal of (1) understanding the impact of faults, errors and failures, and (2) evaluating fault-tolerance and resilience of the targeted systems in the presence of failures. Our first fault injection campaign studies the effects of failures on high-performance computing (HPC) systems. We target the Cray XE Blue Waters JYC testbed at the National Center for Supercomputing Applications, with the goal of improving the understanding of failure causes and propagation observed in the field failure data analysis of Blue Waters. We use data collected from system logs and network performance counters to (1) characterize fault-error-failure sequences and recovery mechanisms in Gemini interconnection networks and in Cray compute elements, (2) understand the impact of failures on the system and user applications at different scales, and (3) identify and recreate fault scenarios that induce unrecoverable failures, to create new tests for system and application design. We utilize HPCArrow, a newly developed software-implemented fault injection tool with the ability to disable and restore user-specified network links, directional connections, compute nodes and blades. We observe failures manifesting in the form of applications not making forward progress and network quiescence operations causing extended system recovery times. Our second fault injection campaign studies the effects of faults, attacks and failures on a smart power grid utilizing software-defined networking (SDN) to orchestrate its data acquisition network. We evaluate our fault models on a smart power grid simulation running Raincoat, an SDN application that reroutes and spoofs network traffic to thwart attackers. Additionally, we propose an application- and data plane-based solution to pro-actively monitor system state and enforce user defined policies. We show that under certain faults, (1) applications orchestrating the network become ineffective, and (2) periodically monitoring the state of the network can identify faults or attacks before they manifest as failures. The results obtained from this work can aid in enhancing the resiliency of future SDN applications.
- Graduation Semester
- Type of Resource
- Copyright and License Information
- Copyright 2018 Lavin R. Devnani
Edit Collection Membership