Withdraw
Loading…
Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs
Wu, Kun
Loading…
Permalink
https://hdl.handle.net/2142/127142
Description
- Title
- Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs
- Author(s)
- Wu, Kun
- Issue Date
- 2024-12-05
- Director of Research (if dissertation) or Advisor (if thesis)
- Hwu, Wen-mei
- Doctoral Committee Chair(s)
- Hwu, Wen-mei
- Committee Member(s)
- Chen, Deming
- Lumetta, Steven Sam
- Patel, Sanjay
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- data intensiveness, GPU, code generation, PCIe, deep learning, LLM, GNN, SSD, performance optimizations, PyTorch
- Abstract
- As deep learning models scale, their training cost has surged significantly. Due to both hardware advancements and limitations in current software stacks, the need for data efficiency has risen. Data efficiency refers to the effective hiding of data access latency and the avoidance of unnecessary data movements. Major challenges arise from the growing disparity between GPU memory bandwidth and computational throughput, imminent GPU memory capacity limitations, and inefficiencies in the PyTorch software stack, including a lack of device-specific PCIe transfer optimizations and high-level domain-specific abstractions. To effectively mitigate these data inefficiencies for deep learning training, this dissertation analyzes data inefficiency in representative deep training tasks, specifically in graph neural networks (GNNs) and large language models (LLMs). It then proposes novel runtime and code generation techniques to mitigate these challenges and implements these optimizations seamlessly within the PyTorch stack while maintaining strong programmability and interoperability. First, PyTorch-Direct is devised to incorporate the GPU-centric PCIe data transfer paradigm in PyTorch for GNN training. PyTorch-Direct significantly reduces CPU utilization, resulting in higher end-to-end training performance. For the input datasets and GNN architectures evaluated, PyTorch-Direct decreases the overall training time by up to 38.2%. Next, Hector intermediate representation (IR) and its code generator are proposed to introduce domain-specific high-level abstraction and systematically address memory-intensive performance challenges for relational graph neural networks (RGNNs). The performance challenges stem from RGNN's inherent memory intensiveness, the gap between the programming interface and the kernel APIs, and the high kernel optimization cost due to the kernels' coupling with layout and heterogeneity. Using a general matrix multiply (GEMM) template and a traversal template, Hector achieves up to a 43.7× speed-up in training and inference compared to the state-of-the-art systems. Linear operator reordering and compact tensor materialization further achieve up to 3.8× speed-up compared to the Hector unoptimized code. Finally, in LLM training, the throughput has been increasingly constrained by GPU memory capacity. To mitigate this, the SSDTrain offloading framework is designed and implemented. Since activations take most of the GPU memory, SSDTrain offloads activations to Non-Volatile Memory Express (NVMe) SSDs with a direct GPU–SSD data path and good interoperability. The evaluation shows that SSDTrain reduces activations peak memory use by up to 47% with negligible overhead. We further analyze how the reduced activation memory use may be leveraged to increase throughput by increasing micro-batch size and reducing pipeline parallelism bubbles. Together, these contributions demonstrate that code generation and runtime techniques can systematically mitigate the data management bottlenecks in deep learning training, which stem from the data-intensive nature of workloads and the oversimplification inherent in the deep learning training software stack.
- Graduation Semester
- 2024-12
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/127142
- Copyright and License Information
- Copyright 2024 Kun Wu
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…