Files in this item



application/pdfHASSANNMAHMOUD-DISSERTATION-2020.pdf (14MB)
(no description provided)PDF


Title:Towards scalable and specialized application error analysis
Author(s):Hassan N Mahmoud, Abdulrahman
Director of Research:Adve, Sarita V
Doctoral Committee Chair(s):Adve, Sarita V
Doctoral Committee Member(s):Marinov, Darko; Fletcher, Christopher W; Misailovic, Sasa; Hari, Siva Kumar Sastry; Ceze, Luis
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Computer Architecture
Approximate Computing
Hardware Resilience
Software-directed Reliability
Software Testing
Deep Neural Networks
Soft Errors
Domain-Specific Reliability
Abstract:Modern systems at scale are increasingly susceptible to transient hardware errors at current technology sizes from natural phenomena such as high-energy particle strikes (also called soft errors). Traditional solutions aimed at dealing with soft errors, however, typically rely on indiscriminate redundancy in space and/or time for resilience. Such techniques can incur high system overheads, whether in manufacturing cost, runtime performance, energy consumption, and/or area requirements. Moreover, the all-or-nothing protection offered by full redundancy may result in over-protection and inefficient use of resources. To that end, while it is critical to be able to protect against the effect of hardware errors, it is important to do so in an an efficient and low-cost manner. One way to reduce the cost of protecting applications from hardware errors is to understand how errors propagate at finer granularities, and only protect vulnerable components via selective duplication. This raises three important questions: 1. What granularity of analysis is reasonable to target? 2. Which components at this granularity should be selected for protection? 3. How should the selective protection be implemented in a low-cost manner? This thesis addresses these three questions with the design of multiple tools and techniques geared towards identifying and understanding how single-bit flip errors propagate and affect an application’s output. First, we present a general-purpose tool called Approxilyzer. Approxilyzer uses the novel error pruning and equalization techniques pioneered by a prior tool, Relyzer, to quantify the impact of virtually every error site in an application. Targeting the instruction-level granularity for analysis and protection, Approxilyzer shows that not all errors are equally important, and that trading off a small output quality degradation (for example, 1%) can yield large resiliency overhead reduction (up to 55%) for 99% resiliency coverage. While Approxilyzer is a promising tool for resiliency analysis, it initially took a long time to run due to the large number of error sites requiring exploration in an application. To accelerate error analysis tools (such as Approxilyzer), the second part of this thesis introduces a software-testing inspired toolkit called Minotaur. Minotaur bridges the gap between software testing and hardware resiliency by adapting multiple techniques from the software engineering domain to make hardware error analysis faster and thus more scalable. We show that Minotaur can significantly improve the runtime of Approxilyzer (10.3× on average), while simultaneously improving its accuracy in identifying vulnerable instructions which need protection. The third part of this thesis focuses on reducing the implementation overhead of instruction-level duplication, by taking into consideration the hardware platform and unique opportunities provided by the backend architecture. Specifically, we develop a tool called SInRG, or Software-managed Instruction Replication for GPUs. SInRG provides a family of instruction duplication techniques that exploit underutilized hardware resources for error detection. Inspired by CPU instruction-level duplication, SInRG establishes the first practical approach to software-directed instruction duplication for GPU-based systems, identifies GPU-specific opportunities for overhead reduction, and explores software and hardware performance optimizations to lower the overheads of replication significantly. The GPU-specific software optimizations trade off error containment for performance and reduce the average runtime overhead to 36%. We also propose new ISA extensions with limited hardware changes and area costs to further lower the average runtime overhead to just 30%. General purpose error analysis and hardening techniques provide the benefit of being universally applicable to general purpose code. Given additional information about the application, however, can further enable low-cost resiliency solutions by leveraging domain knowledge. The fourth part of this thesis uses this premise to perform a specialized resiliency analysis for convolutional neural networks (CNNs), due to their prevalence in many safety-critical application such as self-driving cars. We develop and evaluate two selective protection techniques at different target granularities in CNNs (feature map level and inference level), and show that the combination of both techniques is better than the sum of its parts. Our results show that the specialized, domain-specific error analysis and hardening techniques can achieve very high error coverage of 99.78% on average for the CNNs explored, while incurring as low as 20% overhead, or 5× less overhead compared to full duplication. Overall, this thesis focuses on understanding how hardware errors propagate to corrupt an application’s output. We develop multiple tools and techniques for error analysis, and advocate for specialized, selective protection solutions as a means to achieve low overheads while maintaining high error coverage in applications.
Issue Date:2020-12-02
Rights Information:Copyright 2020 Abdulrahman Hassan N Mahmoud
Date Available in IDEALS:2021-03-05
Date Deposited:2020-12

This item appears in the following Collection(s)

Item Statistics