Files in this item
|(no description provided)|
|Title:||Hardware and compiler support for cache coherence in large-scale shared-memory multiprocessors|
|Doctoral Committee Chair(s):||Padua, David A.|
|Department / Program:||Computer Science|
|Degree Granting Institution:||University of Illinois at Urbana-Champaign|
|Subject(s):||Engineering, Electronics and Electrical
Engineering, System Science
|Abstract:||Reducing memory latency is critical to the performance of large-scale parallel systems. Due to the temporal and spatial locality of memory reference patterns, private caches can eliminate redundant memory accesses and thereby reduce both average memory latency and network traffic. However, maintaining cache coherence for such systems is still a challenge. Hardware directories can be very effective, but are too expensive for large-scale multiprocessors.
As an alternative, compiler-directed techniques (4, 5, 6, 7, 8, 9, 10, 11, 14) can be used to maintain coherence. In this approach, cache coherence is maintained locally without directory hardware, thus avoiding the complexity and overhead associated with hardware directories. Although the performance of such schemes has been demonstrated through simulations, most of the studies assume either perfect compile-time analysis or analytical models without real compiler implementations (1, 3, 9, 10, 12, 13). It is still unknown how effectively the compiler can detect potentially stale references and what kind of performance can be obtained using a real compiler. Also, most of the compiler-directed coherence schemes proposed to date have not addressed the real cost of the required hardware support. For example, many of the schemes require expensive hardware support and assume a cache organization with single-word cache lines.
This dissertation addresses these hardware and compiler implementation issues and investigates the feasibility and performance of the compiler-directed cache coherence approach. We propose a new compiler-directed scheme that can be implemented on a large-scale multiprocessor using off-the-shelf microprocessors. The scheme can be adapted to various cache organizations, including multi-word cache lines and byte-addressable architectures. Several system related issues, including critical sections, inter-thread communication, and task migration also have been addressed. The cost of the required hardware support is minimal and proportional to the cache size. The necessary compiler algorithms, including intra- and interprocedural array data flow analysis, have been developed, and implemented in the Polaris parallelizing compiler, and experimentation results on the Perfect Club benchmarks (2) are discussed.
|Rights Information:||Copyright 1996 Choi, Lynn|
|Date Available in IDEALS:||2011-05-07|
|Identifier in Online Catalog:||AAI9702479|