Files in this item



application/pdfKIM-DISSERTATION-2015.pdf (4MB)Restricted to U of Illinois
(no description provided)PDF


Title:Architecting, programming, and evaluating an on-chip incoherent multi-processor memory hierarchy
Author(s):Kim, Wooil
Director of Research:Torrellas, Josep
Doctoral Committee Chair(s):Torrellas, Josep
Doctoral Committee Member(s):Snir, Marc; Padua, David; Gropp, William; Sadayappan, P.
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):incoherent cache hierarchy
scratchpad hierarchy
compiler-directed coherence
Runnemede architecture
Abstract:New architectures for extreme-scale computing need to be designed for higher energy efficiency than current systems. The DOE-funded Traleika Glacier architecture is a recently-proposed extreme-scale manycore that radically simplifies the architecture, and proposes a cluster-based on-chip memory hierarchy without hardware cache coherence. Programming for such an environment, which can use scratchpads or incoherent caches, is challenging. Hence, this thesis focuses on architecting, programming, and evaluating an on-chip incoherent multiprocessor memory hierarchy. This thesis starts by examining incoherent multiprocessor caches. It proposes ISA support for data movement in such an environment, and two relatively user-friendly programming approaches that use the ISA. The ISA support is largely based on writeback and self-invalidation instructions, while the programming approaches involve shared-memory programming either inside a cluster only, or across clusters. The thesis also includes compiler transformations for such an incoherent cache hierarchy. Our simulation results show that, with our approach, the execution of applications on incoherent cache hierarchies can deliver reasonable performance. For execution within a cluster, the average execution time of our applications is only 2% higher than with hardware cache coherence. For execution across multiple clusters, our applications run on average 20% faster than a naive scheme that pushes all the data to the last-level shared cache. Compiler transformations for both regular and irregular applications are shown to deliver substantial performance increases. This thesis then considers scratchpads. It takes the design in the Traleika Glacier architecture and performs a simulation-based evaluation. It shows how the hardware exploits available concurrency from parallel applications. However, it also shows the limitations of the current software stack, which lacks smart memory management and high-level hints for the scheduler.
Issue Date:2015-12-02
Rights Information:Copyright 2015 Wooil Kim
Date Available in IDEALS:2016-03-02
Date Deposited:2015-12

This item appears in the following Collection(s)

Item Statistics