Files in this item



application/pdfRozier_Eric.pdf (885kB)
(no description provided)PDF


Title:Understanding the fault-tolerance properties of large-scale storage systems
Author(s):Rozier, Eric
Director of Research:Sanders, William H.
Doctoral Committee Member(s):Agha, Gul A.; Levinson, Stephen E.; Viswanathan, Mahesh; Zhou, Pin
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):storage systems
Abstract:Modern storage systems continue to increase in scale and complexity as they attempt to meet the increasing storage needs of our society. Additionally, increased requirements to comply with government regulation and consumer expectations have increased the need to make data more available and reliable for longer periods of time. The design of modern and next-generation storage systems is a difficult task that requires high storage capacity and efficiency while also maintaining the data integrity. The rapid advancement of storage system technologies brings with it a level of uncertainty as to the fitness of new designs and methods for meeting the complex requirements. New technologies, like deduplication, promise improved storage efficiency, but their impact on reliability measures is unclear due to the complex relationships inherent to the systems that employ these technologies. Additionally, as systems scale up, they become subject to faults and errors that previous-generation systems may never have encountered due to the rare nature of these faults. Because of the stiffness of the represented systems, and the complex relationships involved, it can be difficult to analyze these environments correctly and efficiently. In this dissertation, we propose a method to analyze storage system reliability by using component-based models coupled with realistic fault models. We solve these complex systems by identifying fault, fault propagation, and mitigation events; by identifying dependence relationships between state variables, events, and rewards; and by decomposing our model at various points during model solution to improve the efficiency of our solution while maintaining the correctness of our reward measures. In particular, we discuss building scalable component-based models of large-scale systems that employ modern reliability methods, such as RAID, and state-of-the-art storage efficiency methods such as deduplication. We present detailed fault models for these systems, including a novel model for undetected disk errors. To enable efficient solution of these models we propose a method to analyze the dependence relationships that underlie storage systems and propose a way to solve these models by identifying and exploiting these relationships when solving for reliability measures. We apply our methods to real-world systems, detail the consequences for the reliability of deduplication, and suggest and evaluate methods to improve reliability while still maintaining improved storage efficiency.
Issue Date:2012-02-06
Genre:Dissertation / Thesis
Rights Information:Copyright 2011 Eric William Davis Rozier
Date Available in IDEALS:2012-02-06
Date Deposited:2011-12

This item appears in the following Collection(s)

Item Statistics