Files in this item
|(no description provided)|
|Title:||Reconfiguration and recovery in distributed memory multicomputers|
|Author(s):||Peercy, Michael Paul|
|Doctoral Committee Chair(s):||Banerjee, Prithviraj|
|Department / Program:||Electrical and Computer Engineering|
|Degree Granting Institution:||University of Illinois at Urbana-Champaign|
|Subject(s):||Engineering, Electronics and Electrical
|Abstract:||As the sizes of distributed memory multiprocessors increase, the likelihood of a fault removing one of the processors from the system grows as well. Such a fault removes some or all of the following: (a) communication paths, (b) processing power, (c) topological consistency, and (d) progress of the running application. In this thesis we propose solutions to handle all of these problems.
We handle the lost communication paths through table-based routing strategies. We present distributed algorithms for filling the tables, after a fault or repair, with the shortest communication paths surviving in the system. Also, we give algorithms for similarly filling broadcast tables and for guaranteeing that the routing tables are deadlock-free.
To take care of the lost processing power, we propose low-cost hardware reconfiguration schemes in which we embed spare processors throughout the multicomputer. One scheme places each spare processor alongside the normal processor on a node; the other places each spare processor along a selected link. We give results from low-level trace-driven simulation of these schemes with six applications. The results show that the overhead due to the hardware reconfiguration is very low.
We handle the problem of topological consistency with the abstraction of virtual spare processors. A faulty processor's workload is evenly divided among a number of nearby nodes, which timeslice the displaced workload along with their native tasks. We present an implementation of this software technique for static reconfiguration on the iPSC/2 hypercube. We give experimental results of our system for a number of applications.
Finally, the lost progress of the application due to fault is handled through our software technique for reconfiguration and recovery. We take advantage of the characteristics of the Actor model of parallel computation and dynamically shadow and check-point the activity of the application. We have implemented our techniques through modifications of the runtime system for the parallel language Charm running on the iPSC/2. After thoroughly discussing the theory and implementation, we give measurements of overhead due to fault tolerance for a number of applications and demonstrate continuance of the applications after injection of a fault.
We present experimental evaluations of most of the concepts proposed in the thesis, using real parallel applications from the numerical and VLSI CAD domain executing on a real distributed memory multicomputer, an Intel iPSC/2 hypercube.
|Rights Information:||Copyright 1994 Peercy, Michael Paul|
|Date Available in IDEALS:||2011-05-07|
|Identifier in Online Catalog:||AAI9512511|
This item appears in the following Collection(s)
Graduate Dissertations and Theses at Illinois
Graduate Theses and Dissertations at Illinois
Dissertations and Theses - Electrical and Computer Engineering
Dissertations and Theses in Electrical and Computer Engineering