Files in this item

FilesDescriptionFormat

application/pdf

application/pdfECE499-Sp2019-cui.pdf (3MB)Restricted to U of Illinois
(no description provided)PDF

Description

Title:Understanding and improving availability of reliable distributed storage systems
Author(s):Cui, Shengkun
Contributor(s):Jha, Saurabh; Kalbarczyk, Zbigniew
Subject(s):Distributed File System; Distributed Storage System; Lustre File System; Blue Waters; Failure Characterization; Data Analysis; Machine Learning; Long Short-term Memory; Recurrent Neural Network; System Probing; Failure Detection; Failure Prediction
Abstract:From genomic sequencing to weather forecasting, high-performance computing systems (HPCs) have profound impacts on scientific breakthroughs and people’s everyday lives. Failures in a HPC environment can result in partial or system-wide outages leading to performance degradation of the applications, wasting computational resource. Recent studies on the availability and reliability of HPC systems have shown that storage system failures are one of the major limiting factors for achieving high system utility. However, there is limited understanding of the storage system failures, their propagation, and impact on application performance. Using statistical analysis and machine learning techniques, we characterize I/O failures in a distributed storage system and their impacts on the applications. The target storage system is the storage system used in Blue Waters, a petascale supercomputer at the University of Illinois at Urbana-Champaign, running Lustre filesystem. Driven by the characterization results, we use a Long Short-term Memory (a type of Recurrent Neural Network) (LSTM) to support runtime detection and localization of failures to a per-storage server granularity. In this thesis, we present an overview of the project, Blue Waters storage system architecture and specifications, Lustre file system background information, Blue Waters storage system failure characterization on NCSA Maintenance Logs, Storage Server Logs and Quality of Service (QoS) measurements, and the machine learning models for runtime failure detection. We also include key algorithms for data cleaning, processing, and analysis, and performance evaluation of the machine learning model for runtime failure detection. Furthermore, we present an extension of our study---using the model we developed for failure prediction.
Issue Date:2019-05
Genre:Other
Type:Text
URI:http://hdl.handle.net/2142/104004
Date Available in IDEALS:2019-06-13


This item appears in the following Collection(s)

Item Statistics