Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs

Wu, Kun

Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs

Wu, Kun

Permalink

https://hdl.handle.net/2142/127142

Description

Title

Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs

Author(s)

Wu, Kun

Issue Date

2024-12-05

Director of Research (if dissertation) or Advisor (if thesis)

Hwu, Wen-mei

Doctoral Committee Chair(s)

Hwu, Wen-mei

Committee Member(s)

Chen, Deming
Lumetta, Steven Sam
Patel, Sanjay

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Data Intensiveness
Gpu
Code Generation
Pcie
Deep Learning
Llm
Gnn
Ssd
Performance Optimizations
Pytorch

Language

eng

Abstract

As deep learning models scale, their training cost has surged significantly. Due to both hardware advancements and limitations in current software stacks, the need for data efficiency has risen. Data efficiency refers to the effective hiding of data access latency and the avoidance of unnecessary data movements. Major challenges arise from the growing disparity between GPU memory bandwidth and computational throughput, imminent GPU memory capacity limitations, and inefficiencies in the PyTorch software stack, including a lack of device-specific PCIe transfer optimizations and high-level domain-specific abstractions. To effectively mitigate these data inefficiencies for deep learning training, this dissertation analyzes data inefficiency in representative deep training tasks, specifically in graph neural networks (GNNs) and large language models (LLMs). It then proposes novel runtime and code generation techniques to mitigate these challenges and implements these optimizations seamlessly within the PyTorch stack while maintaining strong programmability and interoperability. First, PyTorch-Direct is devised to incorporate the GPU-centric PCIe data transfer paradigm in PyTorch for GNN training. PyTorch-Direct significantly reduces CPU utilization, resulting in higher end-to-end training performance. For the input datasets and GNN architectures evaluated, PyTorch-Direct decreases the overall training time by up to 38.2%. Next, Hector intermediate representation (IR) and its code generator are proposed to introduce domain-specific high-level abstraction and systematically address memory-intensive performance challenges for relational graph neural networks (RGNNs). The performance challenges stem from RGNN's inherent memory intensiveness, the gap between the programming interface and the kernel APIs, and the high kernel optimization cost due to the kernels' coupling with layout and heterogeneity. Using a general matrix multiply (GEMM) template and a traversal template, Hector achieves up to a 43.7× speed-up in training and inference compared to the state-of-the-art systems. Linear operator reordering and compact tensor materialization further achieve up to 3.8× speed-up compared to the Hector unoptimized code. Finally, in LLM training, the throughput has been increasingly constrained by GPU memory capacity. To mitigate this, the SSDTrain offloading framework is designed and implemented. Since activations take most of the GPU memory, SSDTrain offloads activations to Non-Volatile Memory Express (NVMe) SSDs with a direct GPU–SSD data path and good interoperability. The evaluation shows that SSDTrain reduces activations peak memory use by up to 47% with negligible overhead. We further analyze how the reduced activation memory use may be leveraged to increase throughput by increasing micro-batch size and reducing pipeline parallelism bubbles. Together, these contributions demonstrate that code generation and runtime techniques can systematically mitigate the data management bottlenecks in deep learning training, which stem from the data-intensive nature of workloads and the oversimplification inherent in the deep learning training software stack.

Graduation Semester

2024-12

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/127142

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs

Wu, Kun

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In