A Fault Tolerance Protocol for Fast Recovery
- A Fault Tolerance Protocol for Fast Recovery
- Chakravorty, Sayantan
- Issue Date
- computer science
- Large machines with tens or even hundreds of thousands of processors are currently in use. As the number of components increases, the mean time between failure will decrease further. Fault tolerance is an important issue for these and the even larger machines of the future. This is borne out by the significant amount of work in the field of fault tolerance for parallel computing. However, recovery-time after a crash in all current fault tolerance protocols is no smaller than the time between the last checkpoint and the crash. This wastes valuable computation time as all the remaining processors wait for the crashed processors to recover. This thesis presents research aimed at developing a fault tolerant protocol that is relevant in the context of parallel computing and provides fast restarts. We propose to combine the ideas of message logging and object based virtualization. We leverage the facts that message logging based protocols do not require all processors to rollback when one processor crashes and that object based virtualization allows work to be moved from one processor to another. We develop a message logging protocol that operates in conjunction with object based virtualization. We evaluate and study the implementation of our protocol in the Charm++/AMPI run-time. We use benchmarks and real world applications to investigate and improve the performance of different aspects of our protocol. We also modify the load balancing framework of the Charm++ run-time to work with the message logging protocol. We show that in the presence of faults, an application using our fault tolerance protocol takes less time to complete than a traditional checkpoint based protocol.
- Type of Resource
- Copyright and License Information
- You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).
Edit Collection Membership