Files in this item
|(no description provided)|
|Title:||Checkpoint-based forward recovery using lookahead execution and rollback validation in parallel and distributed systems|
|Doctoral Committee Chair(s):||Abraham, Jacob A.; Fuchs, W. Kent|
|Department / Program:||Electrical and Computer Engineering|
|Discipline:||Electrical and Computer Engineering|
|Degree Granting Institution:||University of Illinois at Urbana-Champaign|
|Subject(s):||Engineering, Electronics and Electrical
|Abstract:||This thesis studies a forward recovery strategy using checkpointing and optimistic execution in parallel and distributed systems. The approach uses replicated tasks executing on different processors for forward recovery and checkpoint comparison for error detection. To reduce overall redundancy, this approach employs a lower static redundancy in the common error-free situation to detect error than the standard N Module Redundancy scheme (NMR) does to mask off errors. For the rare occurrence of an error, this approach uses some extra redundancy for recovery. To reduce the run-time recovery overhead, lookahead processes are used to advance computation speculatively and a rollback process is used to produce a diagnosis for correct lookahead processes without rollback of the whole system. Both analytical and experimental evaluation have shown that this strategy can provide a nearly error-free execution time even under faults with a lower average redundancy than NMR.
Using checkpoint comparison for error detection calls for a static checkpoint placement in user programs. Checkpoint insertions based on the system clock produce dynamic checkpoints. A compiler-enhanced polling mechanism using instruction-based time measures is utilized to insert static checkpoints into user programs automatically. The technique has been implemented in a GNU CC compiler for Sun workstations. Experiments demonstrate that the approach provides stable checkpoint intervals and reproducible checkpoint placements with performance overhead comparable to a previous compiler-assisted dynamic scheme (CATCH).
Obtaining a consistent recovery line is another issue to consider in this forward recovery strategy. Checkpointing concurrent processes independently may lead to an inconsistent recovery line that causes rollback propagations. In this thesis, an evolutionary approach to establish a consistent recovery line with low overhead is also described. This approach starts a checkpointing session by checkpointing each process locally and independently. During the checkpoint session, those local checkpoints may be updated, and this updating drives the recovery line evolve into a consistent line. Unlike the globally synchronized approach, the evolutionary approach requires no synchronization protocols to reach a consistent state for checkpointing. Unlike the communication synchronized approach, this approach avoids excessive checkpointing by providing a controllable checkpoint placement. Unlike the loosely synchronized schemes, this approach requires neither message retry nor message replay during recovery.
|Rights Information:||Copyright 1992 Long, Junsheng|
|Date Available in IDEALS:||2011-05-07|
|Identifier in Online Catalog:||AAI9305605|
This item appears in the following Collection(s)
Dissertations and Theses - Electrical and Computer Engineering
Dissertations and Theses in Electrical and Computer Engineering
Graduate Dissertations and Theses at Illinois
Graduate Theses and Dissertations at Illinois