Rebound: Scalable checkpointing for coherent shared memory
Agarwal, Rishi
Loading…
Permalink
https://hdl.handle.net/2142/24039
Description
Title
Rebound: Scalable checkpointing for coherent shared memory
Author(s)
Agarwal, Rishi
Issue Date
2011-05-25T15:07:06Z
Director of Research (if dissertation) or Advisor (if thesis)
Torrellas, Josep
Department of Study
Computer Science
Discipline
Computer Science
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Scalable Checkpointing
Shared-Memory Multiprocessors
Faults
Abstract
As we move to large manycores, the hardware-based global checkpointing schemes that have
been proposed for small shared-memory machines do not scale. Scalability barriers include global
operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads.
Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint
and rollback operations around dynamic groups of communicating processors.
To address this problem, this paper introduces Rebound, the first hardware-based scheme for co-
ordinated local checkpointing in multiprocessors with directory-based cache coherence. Rebound
leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it
boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at check-
points, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at
barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and
rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Re-
bound is scalable and has very low overhead. For 64 processors, its average performance overhead
is only 2%, compared to 15% for global checkpointing.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.