Files in this item



application/pdfXuehai_Qian.pdf (2MB)
(no description provided)PDF


Title:Scalable and flexible bulk architecture
Author(s):Qian, Xuehai
Director of Research:Torrellas, Josep
Doctoral Committee Chair(s):Torrellas, Josep
Doctoral Committee Member(s):Snir, Marc; Hwu, Wen-Mei W.; Padua, David A.; Stenstrom, Per; Narayanasamy, Satish
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Memory consistency model
Chunk-based Architecture
Cache Coherence
Abstract:Multicore machines have become pervasive and, as a result, parallel programming has received renewed interest. Unfortunately, writing correct parallel programs is notoriously hard. Looking ahead, multicore designs should take into account support for programmability and productivity, and make it one of the top-class design considerations. This thesis focuses on efficient and scalable architecture supports to improve the programmability of shared-memory architectures. Specifically, we focus on supporting Sequential Consistency (SC), a strong and intuitive memory consistency model. The first part of the thesis focuses on enforcing SC by chunk-based execution. I propose techniques to remove the scalability bottlenecks of chunk-based architectures. Also, I propose the design of an SMT processor to support chunk operations among the contexts in the same processor. The second part of the thesis focuses on enforcing high performance whole-system SC, from language to architecture, by speculative chunk ordering. The third part of the thesis focuses on dynamically detecting SC violations in a directory-based cache coherence protocol precisely. For chunk-based execution to be competitive, a machine must support chunk operations very efficiently. In my research, I focus on an environment with lazy conflict detection. In this environment, a major bottleneck in a large manycore with directory-based coherence is the chunk commit operation. The reason is that a chunk must appear to commit all of its writes atomically — even though the addresses written by the chunk belong to different, distributed directory modules. In addition, the commit may have to compete against other committing chunks that have accessed some of the same addresses—hence prohibiting concurrent commit. To resolve this commit bottleneck, I propose two scalable chunk-commit protocols. The first protocol, called ScalableBulk, innovates with a set of primitives that enable a scalable coherence protocol designed for chunks. Specifically, ScalableBulk is the first work that integrates signatures into the directory design. Signatures enable the concurrent commit of any number of chunks that use the same directory module—as long as their addresses do not overlap. In addition, ScalableBulk introduces a commit protocol that groups all the directories relevant to the chunk in a way that ensures: (i) multiple groups of directories with non-overlapping addresses can form successfully concurrently and (ii) if the directory groups have overlapping addresses, at least one of the groups forms. The second protocol, called IntelliSense, targets two inefficiencies in ScalableBulk. First, a ScalableBulk commit grabs the relevant directory modules in a sequential manner to ensure deadlock freedom, which may incur long latency for large directory groups. Second, two chunks with cross-processor write-after-write (WAW) dependences between them cannot commit concurrently; one squashes the other, even though these are name dependences. To solve the first problem, I propose the IntelliCommit protocol, where a commit grabs all the relevant directory modules in parallel. The idea is for the committing processor to send commitrequest protocol messages to all of the relevant directory modules in parallel, get their responses directly, and finally send them a commit-confirm message. To solve the second problem, I propose the IntelliSquash mechanism. It uses an idea similar to the store buffers in current processors to serialize, without any squash, the commits of two chunks that only have WAWs. The result is that the write sets of the two chunks are applied in a serial manner without squashes. To support chunk-based execution in Simultaneous Multithreading (SMT) processors, I propose BulkSMT [59]. It exploits the close proximity of the contexts in an SMT processor to concurrently run dependent chunks with simple hardware. I perform a broad design space analysis. The designs analyzed include three different schemes for conflict resolution inside the SMT processor. As a result of the analysis, I show for the first time that SMT processors are very cost-effective in supporting the concurrent execution of dependent chunks. The chunk-based execution is effective at enforcing SC in hardware. However, since a memory model deals with the whole computing stack, its semantics are well-defined only when the model is specified and enforced consistently in every layer, from the language to the hardware. Therefore, to harness the benefits of SC, hardware-only SC enforcement is not sufficient — the software can easily violate SC even if the hardware implementation is correct. For correctness, we need to guarantee SC in every system layer, which is called whole-system SC. To enable high performance whole-system SC, I propose UniBlock, the first scheme built from a conventional distributed cache coherence protocol that prevents SC violations due to hardware and software with the same set of techniques. The basic concept in UniBlock is the ordered chunk, which is used by the hardware as the mechanism to enforce hardware SC, and by the compiler as the specification to guide to scope of compiler optimizations that could violate SC. Starting from a conventional relaxed-consistency coherence protocol, UniBlock forms intermittent dynamic chunks when the speculative retirement of an instruction may violate SC. The compiler also marks the optimized code regions as static chunks to ensure correct execution. UniBlock treats static and dynamic chunks in a unified manner, and cleanly supports whole-system SC. The above techniques are used to enforce SC, and involve some changes in the cache coherence protocol. The last work of this thesis is to detect SC violations based on a conventional cache coherence protocol. To address this problem, I propose Volition [60], the first scalable and precise hardware SC violation (SCV) detector that detects SCVs involving an arbitrary number of processors. Volition detects SCVs dynamically as a program runs. While it can be applied to both directory and bus-based coherence protocols, it does not rely on any property that is only available in a bus, such as broadcast. Volition’s idea is to dynamically detect data-dependence cycles across processors by piggybacking information on the coherence transactions. When an SCV is detected, an exception is raised to the software. For a given dynamic execution, Volition suffers no false positives or negatives.
Issue Date:2013-08-22
Rights Information:Copyright 2013 Xuehai Qian
Date Available in IDEALS:2013-08-22
Date Deposited:2013-08

This item appears in the following Collection(s)

Item Statistics