Designing scalable large-scale storage-based GNN framework by efficiently leveraging heterogeneous hardware resources

Park, Jeongmin

Designing scalable large-scale storage-based GNN framework by efficiently leveraging heterogeneous hardware resources

Park, Jeongmin

This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.

Permalink

https://hdl.handle.net/2142/127395

Description

Title

Designing scalable large-scale storage-based GNN framework by efficiently leveraging heterogeneous hardware resources

Author(s)

Park, Jeongmin

Issue Date

2024-12-04

Director of Research (if dissertation) or Advisor (if thesis)

Hwu, Wen-mei

Doctoral Committee Chair(s)

Hwu, Wen-mei

Committee Member(s)

Patel, Sanjay
Chen, Deming
Lumetta, Steven S

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

GPU
parallel computing
GNN
storage
large-scale ML
distributed system

Abstract

Graph Neural Networks (GNNs) are widely used today in applications such as recommendation systems, fraud detection, and node/link classification tasks. As real-world graphs and embeddings used for GNN training continue to grow in scale, their memory footprint often exceeds the memory capacities of GPUs, creating a bottleneck in efficient training. Traditional GNN training frameworks address limited memory by either storing feature data in external storage and fetching it on-demand or sharding the graph across multiple GPUs and transferring data as needed. However, the first approach suffers from high storage latency, while the second is burdened by the high computational costs of graph partitioning, excessive inter-GPU communication, and increased total cost of ownership. To address these challenges, this dissertation introduces three storage-based GNN frameworks—GIDS, LSM-GNN, and SSD-GNN—that span single-GPU, multi-GPU, and multi-node environments, respectively. GIDS accelerates single-GPU GNN training by leveraging GPU thread parallelism to hide storage latency. LSM-GNN extends this to a multi-GPU setting by implementing a system-wide shared cache using NVLink, optimizing memory bandwidth and cache hit rates without graph partitioning. SSD-GNN further scales GNN training across multiple nodes by integrating GPU-initiated direct storage access and a distributed caching protocol to reduce data movement overhead across nodes. These frameworks effectively utilize heterogeneous hardware resources such as SSDs, CPU memory, GPU parallelism, and GPU PCIe bandwidth to optimize data transfer operations and hide storage latency. Prototypes of these systems demonstrate significant improvements in GNN training performance and reduced total cost of ownership compared to state-of-the-art systems, marking a new pathway for scalable, storage-efficient GNN training across a range of compute environments.

Graduation Semester

2024-12

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/127395

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Designing scalable large-scale storage-based GNN framework by efficiently leveraging heterogeneous hardware resources

Park, Jeongmin

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In