Communication and performance analysis in asynchronous many task runtime systems

Mor, Omri

Communication and performance analysis in asynchronous many task runtime systems

Mor, Omri

Permalink

https://hdl.handle.net/2142/129856

Description

Title

Communication and performance analysis in asynchronous many task runtime systems

Author(s)

Mor, Omri

Issue Date

2025-07-10

Director of Research (if dissertation) or Advisor (if thesis)

Snir, Marc

Doctoral Committee Chair(s)

Snir, Marc

Committee Member(s)

Gropp, William D
Rauchwerger, Lawrence
Bosilca, George

Department of Study

Siebel School Comp & Data Sci

Discipline

Computer Science

Degree Granting Institution

University of Illinois Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

High Performance Computing
Hpc
Parallel Computing
Asynchronous Many-task
Amt
Task Runtime
Directed Acyclic Graph
Dag
Task Graph
Performance Analysis
Communication
Networking
Message Passing Interface
Mpi
Lightweight Communication Interface
Lci
Infiniband
Ibverbs
Parsec
Starpu
Cholesky Factorization
Cholesky
Tile Low-rank
Tlr
Hicma
Performance Bottleneck
Bottleneck Analysis
Critical Path

Language

eng

Abstract

High-Performance Computing (HPC) is necessary to support study and research across all disciplines of science, including—but certainly not limited to—physics, biology, chemistry, meteorology and climatology, mechanical and industrial engineering, and even various social sciences. Over the past decades, there have been massive changes in the hardware available on HPC clusters and supercomputers. Commodity server-grade processors have replaced processors specialized for the operations most commonly used in HPC applications. The end of Dennard scaling in the mid-2000s has led to increasingly many processor cores located on the same integrated circuit. To achieve better performance and power efficiency, most supercomputers now utilize various accelerators for computation and communication, leading to significant heterogeneity within each node. Communication hardware is increasingly “smart” and capable of executing operations that have been traditionally executed on the CPU. Algorithms and applications too have changed, becoming more dynamic and adaptive. These changes in HPC hardware have mandated changes in how applications must be written to achieve the best performance. Traditionally, applications have been written using a Bulk-Synchronous Parallel (BSP) programming model, which simplifies reasoning about the application behavior. More recently, there has been a resurgence in Asynchronous Many-Task (AMT) computation models that decompose the application into many tasks, with synchronization necessary only between tasks with data dependencies. These looser synchronization requirements allow applications written for AMT execution runtimes to exhibit markedly improved performance than their BSP predecessors, but their complex structure and dynamism has had significant implications on the analysis of their performance problems and the identification of bottlenecks. As its title suggests, this dissertation focuses on two aspects of AMT programming models, runtimes, and applications: communication and performance analysis. We first consider how inter-node communication in AMT models differs in design and structure from the communications that occur in other parallel programming models. By analyzing the makeup and implementation of communication layers in several current and actively-developed implementations of the AMT programming model, we identify a core set of shared characteristics that are fundamental in the programming patterns resulting from the AMT model. Noting how these properties are markedly different from those of other parallel programming models—and especially from the communication traits of bulk-synchronous applications—we propose a lightweight communication interface that is formulated to directly provide the features that are required by AMT runtimes with low overhead over the underlying hardware interfaces. Using this new communication infrastructure affords performance benefits to AMT applications. However, users cannot easily gain insight into why an application performs as it does or understand how to improve its performance: the asynchrony, dynamism, and unpredictability inherent in the AMT model make it very difficult to identify the root causes of performance bottlenecks—or, often, to even determine whether such a bottleneck exists; and the very asynchrony that allows AMT applications to exhibit better performance and scalability means that different portions of the application and algorithm can be executing in parallel. As a solution, we present a novel analysis technique that utilizes information about task execution and communication status alongside online estimates of critical path progress to diagnose possible constraints and factors that bound the performance of an AMT application as they vary throughout the application’s execution. The efficacy of this approach is demonstrated by analyzing the performance of a semi-sparse linear algebra application: several distinct bottlenecks that occur at disparate stages of the computation, most of which were not previously discovered, are identified and addressed.

Graduation Semester

2025-08

Type of Resource

Text

Handle URL

https://hdl.handle.net/2142/129856

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Communication and performance analysis in asynchronous many task runtime systems

Mor, Omri

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In