IDEALS Home University of Illinois at Urbana-Champaign logo The Alma Mater The Main Quad

Scalable collective message-passing algorithms

Show full item record

Bookmark or cite this item: http://hdl.handle.net/2142/29839

Files in this item

File Description Format
PDF sack_paul.pdf (973KB) (no description provided) PDF
Title: Scalable collective message-passing algorithms
Author(s): Sack, Paul
Advisor(s): Gropp, William
Contributor(s): Kale, Laxmikant; Padua, David; Snir, Marc
Department / Program: Computer Science
Discipline: Computer Science
Degree Granting Institution: University of Illinois at Urbana-Champaign
Degree: Ph.D.
Genre: Doctoral
Subject(s): Supercomputing Message-passing collective algorithms
Abstract: Governments, universities, and companies expend vast resources building the top supercomputers. The processors and interconnect networks become faster, while the number of nodes grows exponentially. Problems of scale emerge, not least of which is collective performance. This thesis identifies and proposes solutions for two major scalability problems. Our first contribution is a novel algorithm for process-partitioning and remapping for exascale systems that has far better time and space scaling than known algorithms. Our evaluations predict an improvement of up to 60x for large exascale systems and arbitrary reduction in the large temporary buffer space required for generating new communicators. Our second contribution consists of several novel collective algorithms for Clos and torus networks. Known allgather, reduce-scatter, and composite algorithms for Clos networks suffer the worst congestion when the largest messages are exchanged, damaging performance. Known algorithms for torus networks use only one network port, regardless of how many are available. Unlike known algorithms, our algorithms have a small amount of redundant communication. Unlike known algorithms, our algorithms can be reordered so that congestion hinders small messages rather than large, and all ports can be fully used on multi-port torus networks. The redundant communication gives us this flexibility. On a 32k-node system, we deliver improvements of up to 11x for the reduce-scatter operation, when the native reduce-satter algorithm does not use special hardware, and 5.5x for the allgather operation. We show large improvements over native algorithms with as few as 16 processors.
Issue Date: 2012-02-06
Genre: thesis
URI: http://hdl.handle.net/2142/29839
Rights Information: Copyright 2011 Paul Sack
Date Available in IDEALS: 2012-02-06
Date Deposited: 2011-12
 

This item appears in the following Collection(s)

Show full item record

Item Statistics

  • Total Downloads: 195
  • Downloads this Month: 3
  • Downloads Today: 0

Browse

My Account

Information

Access Key