Files in this item

FilesDescriptionFormat

application/pdf

application/pdfKeun Soo_Yim.pdf (3MB)
(no description provided)PDF

Description

Title:From experiment to design – fault characterization and detection in parallel computer systems using computational accelerators
Author(s):Yim, Keun Soo
Director of Research:Iyer, Ravishankar K.
Doctoral Committee Chair(s):Iyer, Ravishankar K.
Doctoral Committee Member(s):Sha, Lui R.; Campbell, Roy H.; Abdelzaher, Tarek F.; Chen, Shuo
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):Fault tolerance system design
Experimental validation
Error detection
Fault injection
Measurement-based co-design
Graphics Processing Unit fault tolerance
Message Passing Interface
CPU-GPU hybrid computers
COTS-based mission-critical systems
Reliability
Dependability
Abstract:This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7). The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7). The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of program states that included dynamically allocated memory (to be spatially comprehensive). In GPUs, we used fault injection studies to demonstrate the importance of detecting silent data corruption (SDC) errors that are mainly due to the lack of fine-grained protections and the massive use of fault-insensitive data. This dissertation also presents transparent fault tolerance frameworks and techniques that are directly applicable to hybrid computers built using only commercial off-the-shelf hardware components. This dissertation shows that by developing understanding of the failure characteristics and error propagation paths of target programs, we were able to create fault tolerance frameworks and techniques that can quickly detect and recover from hardware faults with low performance and hardware overheads.
Issue Date:2013-05-24
URI:http://hdl.handle.net/2142/44390
Rights Information:Copyright 2013 Keun Soo Yim
Date Available in IDEALS:2013-05-24
Date Deposited:2013-05


This item appears in the following Collection(s)

Item Statistics