Files in this item

FilesDescriptionFormat

application/pdf

application/pdfGARCIADEGONZALO-DISSERTATION-2020.pdf (3MB)Restricted to U of Illinois
(no description provided)PDF

Description

Title:Techniques for enabling GPU code generation of low-level optimizations and dynamic parallelism from high-level abstractions
Author(s):Garcia de Gonzalo, Simon P
Director of Research:Hwu, Wen-mei
Doctoral Committee Chair(s):Hwu, Wen-mei
Doctoral Committee Member(s):Padua, David; Torrellas, Josep; Hammond, Simon
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):Code Generation
Code Transformation
Parallelism
Parallel Algorithms
Dynamic Parallelism
Heterogeneity
Performance Portability
GPU
DSL
Graph Analytics
Abstract:The relentless demands for improvements in the compute throughput, and energy efficiency have driven HPC systems and Cloud service providers to heavily rely on GPUs. In turn, the availability of GPUs has led scientists and application programmers to invest resources in porting their codes to be GPU compatible. Currently, there are multiple ways to target GPUs for computations. From low-level C style syntax that provides for full control and most performance at the cost of slow code-development times to High-level DSLs that can abstracts the complexities of GPU programming, speeding up code-development at the cost of performance. Between these two extremes GPU libraries, pragma-based annotations, and high-level frameworks attempt to breach the gap between performance and productivity. Regardless of what strategy is used to target GPUs, performance portability remains a challenge. Performance portability is tightly coupled to architectural differences across systems. Different GPU architectures deploy different implementations of certain instructions, such as atomic instructions, or incorporate new low-level primitives to an evolving ISA. Additionally, for many applications achievable performance on any system is highly dependent on the input data being processed. Graph analytic is one such type of applications that are characterized by irregular computation in which achievable performance is dependent on the sparsity of the input graph. Current strategies for dealing with performance portability across both hardware differences and input characteristics require inefficient and time-consuming code re-writing for libraries and low-level languages or are not exposed at all in DSLs or high-level programming frameworks. The work presented herein designs a new set of high-level APIs and qualifiers, as well as specialized Abstract Syntax Tree (AST) transformations for high-level programming languages and DSLs. The proposed transformations enable warp shuffle instructions, atomic instructions (on global and shared memories), and GPU dynamic parallelism to be easily generated. A practical implementation of these transformations is built on Tangram, a high-level kernel synthesis framework. The performance of the automatically generated low-level instructions is compared against another high-level framework and a hand-written high-performance library over three generations of GPU architectures. The performance of the generated code shows up to 7.8x speedup over hand-written code. The new Tangram API that exposes GPU dynamic parallelism is used to implement four graph analytic benchmarks. Performance improvements of the Tangram generated dynamic code using six real-world graphs show between 2x and 50x speedup over the hand-written benchmarks. The speedups across different graph applications and input graphs are discussed in detail. Lastly, a triangle counting application case study is performed in order to ascertain the performance of the newly possible Tangram generated code that leverages all techniques presented in this thesis. Performance of the generated code outperforms a cutting edge, graph challenge finalist, implementation of triangle counting by over 2x. On the whole, the work presented in the thesis demonstrates that code portability across different GPU hardware and across different input for different applications is possible from a high-level programming framework.
Issue Date:2020-07-16
Type:Thesis
URI:http://hdl.handle.net/2142/108570
Rights Information:Copyright 2020 Simon Garcia de Gonzalo
Date Available in IDEALS:2020-10-07
Date Deposited:2020-08


This item appears in the following Collection(s)

Item Statistics