Files in this item



application/pdf9512511.pdf (5MB)Restricted to U of Illinois
(no description provided)PDF


Title:Reconfiguration and recovery in distributed memory multicomputers
Author(s):Peercy, Michael Paul
Doctoral Committee Chair(s):Banerjee, Prithviraj
Department / Program:Electrical and Computer Engineering
Discipline:Electrical Engineering
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Engineering, Electronics and Electrical
Computer Science
Abstract:As the sizes of distributed memory multiprocessors increase, the likelihood of a fault removing one of the processors from the system grows as well. Such a fault removes some or all of the following: (a) communication paths, (b) processing power, (c) topological consistency, and (d) progress of the running application. In this thesis we propose solutions to handle all of these problems.
We handle the lost communication paths through table-based routing strategies. We present distributed algorithms for filling the tables, after a fault or repair, with the shortest communication paths surviving in the system. Also, we give algorithms for similarly filling broadcast tables and for guaranteeing that the routing tables are deadlock-free.
To take care of the lost processing power, we propose low-cost hardware reconfiguration schemes in which we embed spare processors throughout the multicomputer. One scheme places each spare processor alongside the normal processor on a node; the other places each spare processor along a selected link. We give results from low-level trace-driven simulation of these schemes with six applications. The results show that the overhead due to the hardware reconfiguration is very low.
We handle the problem of topological consistency with the abstraction of virtual spare processors. A faulty processor's workload is evenly divided among a number of nearby nodes, which timeslice the displaced workload along with their native tasks. We present an implementation of this software technique for static reconfiguration on the iPSC/2 hypercube. We give experimental results of our system for a number of applications.
Finally, the lost progress of the application due to fault is handled through our software technique for reconfiguration and recovery. We take advantage of the characteristics of the Actor model of parallel computation and dynamically shadow and check-point the activity of the application. We have implemented our techniques through modifications of the runtime system for the parallel language Charm running on the iPSC/2. After thoroughly discussing the theory and implementation, we give measurements of overhead due to fault tolerance for a number of applications and demonstrate continuance of the applications after injection of a fault.
We present experimental evaluations of most of the concepts proposed in the thesis, using real parallel applications from the numerical and VLSI CAD domain executing on a real distributed memory multicomputer, an Intel iPSC/2 hypercube.
Issue Date:1994
Rights Information:Copyright 1994 Peercy, Michael Paul
Date Available in IDEALS:2011-05-07
Identifier in Online Catalog:AAI9512511
OCLC Identifier:(UMI)AAI9512511

This item appears in the following Collection(s)

Item Statistics