Files in this item

FilesDescriptionFormat

application/pdf

application/pdfKIM-DISSERTATION-2015.pdf (1MB)
(no description provided)PDF

Description

Title:Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures
Author(s):Kim, Hee-Seok
Director of Research:Hwu, Wen-mei
Doctoral Committee Chair(s):Hwu, Wen-mei
Doctoral Committee Member(s):Boppart, Stephen; Chen, Deming; Lumetta, Steven
Department / Program:Electrical & Computer Engineering
Discipline:Electrical & Computer Engineering
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):OpenCL
compiler
Central Processing Unit (CPU)
performance portability
data locality
Abstract:The rising pressure to simultaneously improve performance and reduce power consumption is driving more heterogeneity into all aspects of computing devices. However, wide adoption of specialized computing devices such as GPUs and Xeon Phis comes with a programming challenge. A carefully optimized program that is well matched to the target hardware can run many times faster and more energy efficiently than one that is not. Ideally, programmers should write their code using a single programming model, and the compiler would transform the program to run optimally on the target architecture. In practice, however, programmers have to expend great effort to translate performance enjoyed on one platform to another. As such, single-source code-based portability has gained substantial momentum and OpenCL, a bulk-synchronous programming language, has become a popular choice, among others, to fulfill the need for portability. The assumed computing model of these languages is inevitably loosely coupled with an underlying architecture, obligating a combined compiler and runtime to find an efficient execution mapping from the input program onto the architecture which best exploits the hardware for performance. In this dissertation, I argue and demonstrate that obtaining high performance from executing OpenCL programs on CPU is feasible. In order to achieve the goal, I present compiler and runtime techniques to execute OpenCL programs on CPU architectures. First, I propose a compiler technique in which the execution of fine-grained parallel threads, called work-items, is collectively analyzed to consider the impact of scheduling them with respect to data locality. By analyzing the memory addresses accessed in a kernel, the technique can make better decisions on how to schedule work-items to construct better memory access patterns, thereby improving performance. The approach achieves geomean speedups of 3.32x over AMD's and 1.71x over Intel's state-of-the-art implementations on Parboil and Rodinia benchmarks. Second, I propose a runtime that allows a compiler to deposit differently optimized kernels to mitigate the stress on the compiler in deriving the most optimal code. The runtime systematically deploys candidate kernels on a small portion of the actual data to determine which achieves the best performance for the hardware-data combination. It exploits the fact that OpenCL programs typically come with a large number of independent work-groups, a feature that amortizes the cost of profiling execution of a few work-items, while the overhead is further reduced by retaining the profiling execution result to constitute the final execution output. The proposed runtime performs with an average overhead of 3% compared to an ideal/oracular runtime in execution time.
Issue Date:2015-11-12
Type:Thesis
URI:http://hdl.handle.net/2142/88987
Rights Information:Copyright 2015 Hee-Seok Kim
Date Available in IDEALS:2016-03-02
Date Deposited:2015-12


This item appears in the following Collection(s)

Item Statistics