Files in this item



application/pdfGainaru_Ana.pdf (486kB)
(no description provided)PDF


Title:Battling Failures
Author(s):Gainaru, Ana
Subject(s):Computer Science
Abstract:A large percentage of computing capacity in todays large high-performance computing systems is wasted due to failures and recoveries. The fear in our community is that future Exascale systems will fail so frequently that no useful work will be possible. My research is focusing on characterizing the events generated at the hardware, system or application level by understanding the complex correlations between different system components. This information is used to predict failures and as a consequence to minimize or prevent their effects on running applications. The image represents an overview of the overall analysis process: monitoring applications and their performance, modeling the system and the way anomalies propagate between components, analyzing the current state, diagnosing errors and predicting failures. The size and complexity of today's supercomputers is too large to manually inspector visualize all the events that occur during an application's execution. With tools like this, that adapt and learn as the system experiences new events, applications are allowed to take preventive actions that will increase their efficiency and as a consequence will allow them to complete their task even on future Exascale machines.Credits: Images provided by the National Center for Supercomputing Applications Visualization Laboratory.
Issue Date:2014-05
Rights Information:Copyright 2014 Ana Gainaru
Date Available in IDEALS:2014-05-16

This item appears in the following Collection(s)

Item Statistics