Files in this item

FilesDescriptionFormat

application/pdf

application/pdfOsman_Sarood.pdf (1MB)
(no description provided)PDF

Description

Title:Optimizing performance under thermal and power constraints for HPC data centers
Author(s):Sarood, Osman
Director of Research:Kale, Laxmikant V.
Doctoral Committee Chair(s):Kale, Laxmikant V.
Doctoral Committee Member(s):de Supinski, Bronis; Garzaran, Maria J.; Abdelzaher, Tarek F.
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):green computing
load balancing
energy efficiency
High performance computing (HPC) application
power constraint
power cap
Performance optimization
thermal constraint
temperature aware load balancing
frequency aware load balancing
fault tolerance
improving reliability
Abstract:Energy, power and resilience are the major challenges that the HPC community faces in moving to larger supercomputers. Data centers worldwide consumed energy equivalent to 235 billion kWh in 2010. A significant portion of that energy and power consumption is devoted to cooling. This thesis proposes a scheme based on a combination of limiting processor temperatures using Dynamic Voltage and Frequency Scaling (DVFS) and frequency-aware load balancing that reduces cooling energy consumption and prevents hot spot formation. Recent reports have expressed concern that reliability at the exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability has also been making progress independently. A second component of this thesis tries to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Finally, the 10MW consumption of present day HPC systems is certainly becoming a bottleneck. Although energy bills will significantly increase with machine size, power consumption is a hard constraint that must be addressed. Intel’s Running Average Power Limit (RAPL) toolkit is a recent feature that enables power capping of CPU and memory subsystems on modern hardware. The ability to constrain the maximum power consumption of the subsystems below the vendor-assigned Thermal Design Point (TDP) value allows us to add more nodes in an overprovisioned system while ensuring that the total power consumption of the data center does not exceed its power budget. The final component of this thesis proposes an interpolation scheme that uses an application profile to optimize the number of nodes and distribution of power between CPU and memory subsystems that minimizes execution time under a strict power budget. We also present a resource management scheme including a scheduler that uses CPU power capping, hardware overprovisioning, and job malleability to improve the throughput of a data center under a strict power budget.
Issue Date:2014-05-30
URI:http://hdl.handle.net/2142/49478
Rights Information:Copyright 2014 Osman Sarood
Date Available in IDEALS:2014-05-30
Date Deposited:2014-05


This item appears in the following Collection(s)

Item Statistics