Withdraw
Loading…
Communication and performance analysis in asynchronous many task runtime systems
Mor, Omri
Loading…
Permalink
https://hdl.handle.net/2142/129856
Description
- Title
- Communication and performance analysis in asynchronous many task runtime systems
- Author(s)
- Mor, Omri
- Issue Date
- 2025-07-10
- Director of Research (if dissertation) or Advisor (if thesis)
- Snir, Marc
- Doctoral Committee Chair(s)
- Snir, Marc
- Committee Member(s)
- Gropp, William D
- Rauchwerger, Lawrence
- Bosilca, George
- Department of Study
- Siebel School Comp & Data Sci
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- High Performance Computing
- HPC
- Parallel Computing
- Asynchronous Many-Task
- AMT
- Task Runtime
- Directed Acyclic Graph
- DAG
- Task Graph
- Performance Analysis
- Communication
- Networking
- Message Passing Interface
- MPI
- Lightweight Communication Interface
- LCI
- InfiniBand
- ibverbs
- PaRSEC
- StarPU
- Cholesky Factorization
- Cholesky
- Tile Low-Rank
- TLR
- HiCMA
- Performance Bottleneck
- Bottleneck Analysis
- Critical Path
- Abstract
- High-Performance Computing (HPC) is necessary to support study and research across all disciplines of science, including—but certainly not limited to—physics, biology, chemistry, meteorology and climatology, mechanical and industrial engineering, and even various social sciences. Over the past decades, there have been massive changes in the hardware available on HPC clusters and supercomputers. Commodity server-grade processors have replaced processors specialized for the operations most commonly used in HPC applications. The end of Dennard scaling in the mid-2000s has led to increasingly many processor cores located on the same integrated circuit. To achieve better performance and power efficiency, most supercomputers now utilize various accelerators for computation and communication, leading to significant heterogeneity within each node. Communication hardware is increasingly “smart” and capable of executing operations that have been traditionally executed on the CPU. Algorithms and applications too have changed, becoming more dynamic and adaptive. These changes in HPC hardware have mandated changes in how applications must be written to achieve the best performance. Traditionally, applications have been written using a Bulk-Synchronous Parallel (BSP) programming model, which simplifies reasoning about the application behavior. More recently, there has been a resurgence in Asynchronous Many-Task (AMT) computation models that decompose the application into many tasks, with synchronization necessary only between tasks with data dependencies. These looser synchronization requirements allow applications written for AMT execution runtimes to exhibit markedly improved performance than their BSP predecessors, but their complex structure and dynamism has had significant implications on the analysis of their performance problems and the identification of bottlenecks. As its title suggests, this dissertation focuses on two aspects of AMT programming models, runtimes, and applications: communication and performance analysis. We first consider how inter-node communication in AMT models differs in design and structure from the communications that occur in other parallel programming models. By analyzing the makeup and implementation of communication layers in several current and actively-developed implementations of the AMT programming model, we identify a core set of shared characteristics that are fundamental in the programming patterns resulting from the AMT model. Noting how these properties are markedly different from those of other parallel programming models—and especially from the communication traits of bulk-synchronous applications—we propose a lightweight communication interface that is formulated to directly provide the features that are required by AMT runtimes with low overhead over the underlying hardware interfaces. Using this new communication infrastructure affords performance benefits to AMT applications. However, users cannot easily gain insight into why an application performs as it does or understand how to improve its performance: the asynchrony, dynamism, and unpredictability inherent in the AMT model make it very difficult to identify the root causes of performance bottlenecks—or, often, to even determine whether such a bottleneck exists; and the very asynchrony that allows AMT applications to exhibit better performance and scalability means that different portions of the application and algorithm can be executing in parallel. As a solution, we present a novel analysis technique that utilizes information about task execution and communication status alongside online estimates of critical path progress to diagnose possible constraints and factors that bound the performance of an AMT application as they vary throughout the application’s execution. The efficacy of this approach is demonstrated by analyzing the performance of a semi-sparse linear algebra application: several distinct bottlenecks that occur at disparate stages of the computation, most of which were not previously discovered, are identified and addressed.
- Graduation Semester
- 2025-08
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129856
- Copyright and License Information
- © 2025 Omri Mor
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…