Files in this item



application/pdfSiva Kumar_Hari.pdf (2MB)
(no description provided)PDF


Title:Preserving application reliability on unreliable hardware
Author(s):Hari, Siva Kumar
Director of Research:Adve, Sarita V.
Doctoral Committee Chair(s):Adve, Sarita V.
Doctoral Committee Member(s):Adve, Vikram S.; Bertacco, Valeria; Naeimi, Helia; King, Samuel T.; Rutenbar, Robin A.
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Computer Architecture
Hardware Reliability
Abstract:According to Moore’s law, technology scaling is continuously providing smaller and faster devices. These scaled devices are, however, becoming increasingly susceptible to in-field hardware failures from sources such as high-energy particle strikes (or soft-errors). This reliability threat is expected to affect a broad computing market motivating the need for very low cost resiliency solutions. Software anomaly based hardware error detection has emerged as an effective low cost solution. A small fraction of hardware errors, however, escape these anomaly detectors and produce Silent Data Corruptions (or SDCs). Eliminating or significantly lowering the user-visible SDC rate is crucial for software-driven reliability solutions to become practically successful. The goal of this thesis, therefore, is to provide programmers and system designers with tools and techniques to evaluate software-driven resiliency solutions, identify application locations that are susceptible to producing SDCs, and provide application-centric SDC mitigation techniques to achieve significantly lower SDC rates for a given performance (and/or power) budget with low effort. The first part of this thesis presents an approach called Relyzer that addresses the challenges in identifying virtually all application locations that are susceptible to producing SDCs when subjected to soft errors. Instead of performing expensive error injections on all possible application-level error sites, which is impractical, Relyzer carefully picks a small set of representatives called pilots. It employs novel error site pruning techniques to reduce the number of detailed error injections. The key insight is to show equivalence between application-level error sites, as apposed to only predicting their outcomes. Relyzer uses program structure and dynamic information to show equivalence between error sites from different dynamic instances of the same static instruction or variable. Results show that 99.78% of error sites are pruned across twelve studied workloads, requiring only a few expensive error injections to determine the vulnerability of all application-level error sites. While performing error injection experiments on the remaining error sites (one at a time) is practical, it still requires significant simulation time. Therefore, Relyzer proposes and employs a gang error simulator called GangES as a performance enhancement technique. GangES bundles multiple error (pilot) simulations and periodically compares simulation states to allow early termination of equivalent ones, saving simulation time that is otherwise needed to run the application to completion and verify the output. GangES attempts to show equivalence between error sites from different static instructions and variables that were not considered by Relyzer earlier. Results show that GangES provides a total error simulation time savings of 51%. The second part of the thesis employs Relyzer’s capability of identifying virtually all SDC-causing program locations for three different purposes. • Developing SDC-targeted application-centric error detectors is a primary application of Relyzer. To achieve this goal, this thesis employs Relyzer to identify and analyze program properties that appear around most SDC-producing program locations. Exploiting this analysis, it then develops low cost program-level error detectors that are shown to be effective in reducing the reliance on expensive (instruction-level) redundancy for full SDC coverage. • Architects often over-provision systems for higher resiliency, trading off power or performance, due to lack of efficient techniques that allow such tuning. Relyzer enables tuning for resiliency by identifying virtually all SDC-vulnerable program locations and selectively adding error detectors. Employing Relyzer and the SDC-targeted error detectors (developed as a part of the first application), this thesis obtains practical and flexible points on performance vs. resiliency trade-off curves. For example, for an average SDC reduction of 90% and 99%, the average execution overheads of this approach versus selective redundancy alone are respectively 12% vs. 30% and 19% vs. 43%. • This thesis also studies (previously proposed) pure program analyses based metrics and some derivatives that do not need error injection experiments as faster alternatives to identify SDCcausing program locations. Although the results are largely negative, they provide evidence that such models are not straightforward to determine and signify the importance of Relyzer.
Issue Date:2014-01-16
Rights Information:Copyright 2013 Siva Kumar Hari
Date Available in IDEALS:2014-01-16
Date Deposited:2013-12

This item appears in the following Collection(s)

Item Statistics