Files in this item



application/pdfNI-DISSERTATION-2016.pdf (4MB)
(no description provided)PDF


Title:Mitigation of failures in high performance computing via runtime techniques
Author(s):Ni, Xiang
Director of Research:Kalé, Laxmikant V
Doctoral Committee Chair(s):Kalé, Laxmikant V
Doctoral Committee Member(s):Vaidya, Nitin; Kramer, William; Cappello, Franck
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):runtime system
fault tolerance
silent data corruption
hard error
soft error
solid state disk
high performance computing
Abstract:As machines increase in scale, it is predicted that failure rates of supercomputers will correspondingly increase. Even though the mean time to failure (MTTF) of individual component is high, the large number of components significantly decreases the system MTTF. Meanwhile, the decreasing size of transistors has been critical to the increase in capacity of supercomputers. The smaller the transistors are, silent data corruptions (SDC) are likely to occur more frequently. SDCs do not inhibit execution, but may silently lead to incorrect results. In this thesis, we leverage runtime system and compiler techniques to mitigate a significant fraction of failures automatically with low overhead. The main goals of various system-level fault tolerance strategies designed in this thesis are: reducing the extra cost added to application execution while improving system reliability; automatically adjusting fault tolerance decisions without user intervention based on environmental changes; protecting applications not only from fail-stop failures but also from silent data corruptions. The main contributions of this thesis are development of a semi-blocking checkpoint protocol that overlaps application execution with fault tolerance operation to reduce the overhead of checkpointing, a runtime system technique for automatic checkpoint and restart without user intervention, a holistic framework (ACR) for automatically detecting and recovering from silent data corruptions and a framework called FlipBack that provides targeted protection against silent data corruption with low cost.
Issue Date:2016-07-12
Rights Information:Copyright 2016 Xiang Ni
Date Available in IDEALS:2016-11-10
Date Deposited:2016-08

This item appears in the following Collection(s)

Item Statistics