Files in this item

FilesDescriptionFormat

application/pdf

application/pdfELHAJJ-DISSERTATION-2018.pdf (3MB)
(no description provided)PDF

Description

Title:Techniques for optimizing dynamic parallelism on graphics processing units
Author(s):El Hajj, Izzat
Director of Research:Hwu, Wen-mei W
Doctoral Committee Chair(s):Hwu, Wen-mei W
Doctoral Committee Member(s):Chen, Deming; Lumetta, Steven S; Milojicic, Dejan S
Department / Program:Electrical & Computer Eng
Discipline:Electrical & Computer Engr
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):Graphics Processing Units
Dynamic Parallelism
Compilers
CUDA
Abstract:Dynamic parallelism is a feature of general purpose graphics processing units (GPUs) whereby threads running on a GPU can spawn other threads without CPU intervention. This feature is useful for programming applications with nested parallelism where threads executing in parallel may each identify additional work that can itself be parallelized. Unfortunately, current GPU microarchitectures do not efficiently support using dynamic parallelism for accelerating applications with nested parallelism due to the high overhead of grid launches, the limited number of grids that can execute simultaneously, and the limited supported depth of the dynamic call stack. The compiler techniques presented herein improve the performance of applications with nested parallelism that use dynamic parallelism by mitigating the aforementioned microarchitectural limitations. Horizontal aggregation fuses grids launched by threads in the same warp, block, or grid into a single aggregated grid, thereby reducing the total number of grids launched and increasing the amount of work per grid to improve occupancy. Vertical aggregation fuses grids down the call stack with their descendant grids, again reducing the total number of grids launched but also reducing the depth of the call stack and removing grid launches from the application's critical path. Evaluation of these compiler techniques shows that they result in substantial performance improvement over regular dynamic parallelism for benchmarks representing common nested parallelism patterns. This observation has held true for multiple architecture generations, showing the continued relevance of these techniques. This work shows that to make dynamic parallelism practical for accelerating applications with nested parallelism, compiler transformations can be used to aggregate dynamically launched grids, thereby amortizing their launch overhead and improving their occupancy, without the need for additional hardware support.
Issue Date:2018-12-06
Type:Thesis
URI:http://hdl.handle.net/2142/102488
Rights Information:Copyright 2018 Izzat El Hajj
Date Available in IDEALS:2019-02-06
Date Deposited:2018-12


This item appears in the following Collection(s)

Item Statistics