Files in this item



application/pdfAgarwal_Rishi.pdf (284kB)
(no description provided)PDF


Title:Rebound: Scalable checkpointing for coherent shared memory
Author(s):Agarwal, Rishi
Advisor(s):Torrellas, Josep
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Scalable Checkpointing
Shared-Memory Multiprocessors
Abstract:As we move to large manycores, the hardware-based global checkpointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the first hardware-based scheme for co- ordinated local checkpointing in multiprocessors with directory-based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at check- points, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Re- bound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.
Issue Date:2011-05-25
Rights Information:Copyright 2011 Rishi Agarwal
Date Available in IDEALS:2011-05-25
Date Deposited:2011-05

This item appears in the following Collection(s)

Item Statistics