|Abstract:||From genomic sequencing to weather forecasting, high-performance computing systems (HPCs)
have profound impacts on scientific breakthroughs and people’s everyday lives. Failures in a HPC
environment can result in partial or system-wide outages leading to performance degradation of the
applications, wasting computational resource. Recent studies on the availability and reliability of HPC
systems have shown that storage system failures are one of the major limiting factors for achieving high
system utility. However, there is limited understanding of the storage system failures, their propagation,
and impact on application performance.
Using statistical analysis and machine learning techniques, we characterize I/O failures in a
distributed storage system and their impacts on the applications. The target storage system is the storage
system used in Blue Waters, a petascale supercomputer at the University of Illinois at Urbana-Champaign,
running Lustre filesystem. Driven by the characterization results, we use a Long Short-term Memory (a
type of Recurrent Neural Network) (LSTM) to support runtime detection and localization of failures to a
per-storage server granularity.
In this thesis, we present an overview of the project, Blue Waters storage system architecture and
specifications, Lustre file system background information, Blue Waters storage system failure
characterization on NCSA Maintenance Logs, Storage Server Logs and Quality of Service (QoS)
measurements, and the machine learning models for runtime failure detection. We also include key
algorithms for data cleaning, processing, and analysis, and performance evaluation of the machine learning
model for runtime failure detection. Furthermore, we present an extension of our study---using the model
we developed for failure prediction.