Withdraw
Loading…
Towards scalable and specialized application error analysis
Hassan N Mahmoud, Abdulrahman
Loading…
Permalink
https://hdl.handle.net/2142/109425
Description
- Title
- Towards scalable and specialized application error analysis
- Author(s)
- Hassan N Mahmoud, Abdulrahman
- Issue Date
- 2020-12-02
- Director of Research (if dissertation) or Advisor (if thesis)
- Adve, Sarita V
- Doctoral Committee Chair(s)
- Adve, Sarita V
- Committee Member(s)
- Marinov, Darko
- Fletcher, Christopher W
- Misailovic, Sasa
- Hari, Siva Kumar Sastry
- Ceze, Luis
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Reliability
- Computer Architecture
- Approximate Computing
- Hardware Resilience
- Software-directed Reliability
- Software Testing
- Deep Neural Networks
- Soft Errors
- Domain-Specific Reliability
- Abstract
- Modern systems at scale are increasingly susceptible to transient hardware errors at current technology sizes from natural phenomena such as high-energy particle strikes (also called soft errors). Traditional solutions aimed at dealing with soft errors, however, typically rely on indiscriminate redundancy in space and/or time for resilience. Such techniques can incur high system overheads, whether in manufacturing cost, runtime performance, energy consumption, and/or area requirements. Moreover, the all-or-nothing protection offered by full redundancy may result in over-protection and inefficient use of resources. To that end, while it is critical to be able to protect against the effect of hardware errors, it is important to do so in an an efficient and low-cost manner. One way to reduce the cost of protecting applications from hardware errors is to understand how errors propagate at finer granularities, and only protect vulnerable components via selective duplication. This raises three important questions: 1. What granularity of analysis is reasonable to target? 2. Which components at this granularity should be selected for protection? 3. How should the selective protection be implemented in a low-cost manner? This thesis addresses these three questions with the design of multiple tools and techniques geared towards identifying and understanding how single-bit flip errors propagate and affect an application’s output. First, we present a general-purpose tool called Approxilyzer. Approxilyzer uses the novel error pruning and equalization techniques pioneered by a prior tool, Relyzer, to quantify the impact of virtually every error site in an application. Targeting the instruction-level granularity for analysis and protection, Approxilyzer shows that not all errors are equally important, and that trading off a small output quality degradation (for example, 1%) can yield large resiliency overhead reduction (up to 55%) for 99% resiliency coverage. While Approxilyzer is a promising tool for resiliency analysis, it initially took a long time to run due to the large number of error sites requiring exploration in an application. To accelerate error analysis tools (such as Approxilyzer), the second part of this thesis introduces a software-testing inspired toolkit called Minotaur. Minotaur bridges the gap between software testing and hardware resiliency by adapting multiple techniques from the software engineering domain to make hardware error analysis faster and thus more scalable. We show that Minotaur can significantly improve the runtime of Approxilyzer (10.3× on average), while simultaneously improving its accuracy in identifying vulnerable instructions which need protection. The third part of this thesis focuses on reducing the implementation overhead of instruction-level duplication, by taking into consideration the hardware platform and unique opportunities provided by the backend architecture. Specifically, we develop a tool called SInRG, or Software-managed Instruction Replication for GPUs. SInRG provides a family of instruction duplication techniques that exploit underutilized hardware resources for error detection. Inspired by CPU instruction-level duplication, SInRG establishes the first practical approach to software-directed instruction duplication for GPU-based systems, identifies GPU-specific opportunities for overhead reduction, and explores software and hardware performance optimizations to lower the overheads of replication significantly. The GPU-specific software optimizations trade off error containment for performance and reduce the average runtime overhead to 36%. We also propose new ISA extensions with limited hardware changes and area costs to further lower the average runtime overhead to just 30%. General purpose error analysis and hardening techniques provide the benefit of being universally applicable to general purpose code. Given additional information about the application, however, can further enable low-cost resiliency solutions by leveraging domain knowledge. The fourth part of this thesis uses this premise to perform a specialized resiliency analysis for convolutional neural networks (CNNs), due to their prevalence in many safety-critical application such as self-driving cars. We develop and evaluate two selective protection techniques at different target granularities in CNNs (feature map level and inference level), and show that the combination of both techniques is better than the sum of its parts. Our results show that the specialized, domain-specific error analysis and hardening techniques can achieve very high error coverage of 99.78% on average for the CNNs explored, while incurring as low as 20% overhead, or 5× less overhead compared to full duplication. Overall, this thesis focuses on understanding how hardware errors propagate to corrupt an application’s output. We develop multiple tools and techniques for error analysis, and advocate for specialized, selective protection solutions as a means to achieve low overheads while maintaining high error coverage in applications.
- Graduation Semester
- 2020-12
- Type of Resource
- Thesis
- Permalink
- http://hdl.handle.net/2142/109425
- Copyright and License Information
- Copyright 2020 Abdulrahman Hassan N Mahmoud
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…