Files in this item



application/pdf9625186.pdf (5MB)Restricted to U of Illinois
(no description provided)PDF


Title:Manual and compiler assisted methods for generating fault-tolerant parallel programs
Author(s):Roy-Chowdhury, Amber
Doctoral Committee Chair(s):Banerjee, Prithviraj
Department / Program:Electrical and Computer Engineering
Discipline:Electrical Engineering
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Engineering, Electronics and Electrical
Computer Science
Abstract:Algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating fault-tolerance into existing applications. Applications are modified to operate on encoded data and produce encoded results which may then be checked for correctness. An attractive feature of the scheme is that it requires little or no modification to the underlying hardware or system software. Previous algorithm-based methods for developing reliable versions of numerical programs for general-purpose muiticomputers have mostly concerned themselves with error detection. A truly fault-tolerant algorithm, however, needs to locate errors and recover from them once they have been located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In this dissertation, we first present a general scheme for performing fault-location and recovery under the ABFT framework. Our fault model assumes that a faulty processor can corrupt all of the data it possesses. The fault-location scheme is an application of system-level diagnosis theory to the ABFT framework, while the fault-recovery scheme uses ideas from coding theory to maintain redundant data and uses this to recover corrupted data in the event of processor failures. Results are presented on implementations of three numerical algorithms on a distributed memory multicomputer, which demonstrate acceptably low overheads for the single- and double-fault location and recovery cases.
For a class of algorithms performing affine transformations, we automate the process of generating an error-detecting version at compile time. The compiler is used to identify loops that perform affine transformations on array elements. These loops are then checked by computing a checksum over the array elements being transformed and transforming the checksums appropriately, which typically results in much smaller overheads than checking the entire code by duplication. Portions of code in the program that are not affine transformations are checked by duplication. An existing source-to-source compiler, Parafrase-2, has been modified to take in programs written in High Performance Fortran (HPF) and output an error-detecting version of the same. Data distributions for the new arrays and checksums introduced are specified by inserting additional HPF directives in the program. The modified program can then be input to a parallelizer for distributed memory machines, such as PARADIGM, to obtain an error-detecting parallel program. We demonstrate results on three numerical programs by executing the error-detecting versions generated by our compiler on a distributed memory multicomputer.
Issue Date:1996
Rights Information:Copyright 1996 Roy-Chowdhury, Amber
Date Available in IDEALS:2011-05-07
Identifier in Online Catalog:AAI9625186
OCLC Identifier:(UMI)AAI9625186

This item appears in the following Collection(s)

Item Statistics