Files in this item

FilesDescriptionFormat

application/pdf

application/pdfACUN-DISSERTATION-2017.pdf (4MB)
(no description provided)PDF

Description

Title:Mitigating variability in HPC systems and applications for performance and power efficiency
Author(s):Acun, Bilge
Director of Research:Kalé, Laxmikant V
Doctoral Committee Chair(s):Kalé, Laxmikant V
Doctoral Committee Member(s):Abdelzaher, Tarek; Torrellas, Josep; Beckman, Pete
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):Power
Energy
Temperature
Frequency
High performance computing (HPC)
Variability
Data center
Performance
Supercomputer
Energy consumption
Power variation
Frequency variation
Temperature variation
Energy efficient algorithms
Cooling power
Fan control
Runtime systems
Load balancing
Dynamic runtimes
Manufacturing variations
Turbo-boost
Dynamic voltage and frequency scaling (DVFS)
Parallel computing
Abstract:Power consumption and process variability are two important, interconnected, challenges of future generation large-scale High Performance Computing (HPC) data centers. For example, current production petaflop supercomputers consume more than 10 megawatts of machine and cooling power that costs millions of dollars every year. As HPC moves towards exascale computing, these costs will increase and power consumption is expected to become a major concern. Not solely dynamic behavior of HPC applications but also dynamic behavior of HPC systems makes it challenging to optimize the performance and power efficiency of large scale applications. Dynamic behavior of applications include irregular or imbalanced applications. Dynamic behavior of HPC systems include thermal, power, and frequency variations among processors. Smart and adaptive runtime systems have great potential to handle these challenges transparently from the application. In this dissertation, I first analyze frequency, temperature, and power variations in large- scale HPC systems using thousands of cores and different applications. After I identify the cause of each of these variations, I propose solutions to mitigate these variations to improve performance and power efficiency. When analyzing frequency variation, I attribute manufacturing related intrinsic differences in the chips’ power efficiency as the culprit behind frequency variation under dynamic overclocking. I propose speed-aware dynamic load balancing strategies to mitigate the performance overhead due to frequency variation. When analyzing temperature variation, I focus on inefficiencies in fan-based air cooling systems. I propose proactive and decoupled fan control mechanisms that reduce temperature variations and reduce cooling power consumption by predicting core temperatures using a learning based model. When analyzing power variations, I identify manufacturing related sources of power variation that are static and dynamic. I propose different variation aware node assembly methods to mitigate the power variation. Finally, I propose a fine-grained runtime based technique to mitigate application level variations that are caused by the characteristics of the application itself (for example, applications with different kernel types or phases) in order to reduce the energy consumption.
Issue Date:2017-12-06
Type:Text
URI:http://hdl.handle.net/2142/99502
Rights Information:Copyright 2017 Bilge Acun
Date Available in IDEALS:2018-03-13
2020-03-14
Date Deposited:2017-12


This item appears in the following Collection(s)

Item Statistics