Files in this item

FilesDescriptionFormat

application/pdf

application/pdfTHAKORE-DISSERTATION-2020.pdf (2MB)Restricted Access
(no description provided)PDF

Description

Title:Improving reliability and security monitoring in enterprise and cloud systems by leveraging information redundancy
Author(s):Thakore, Uttam
Director of Research:Sanders, William H
Doctoral Committee Chair(s):Sanders, William H
Doctoral Committee Member(s):Gupta, Indranil; Nahrstedt, Klara; Ranchal, Rohit; Ramasamy, Harigovind V
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):monitoring
reliability
security
compliance audit
cloud computing
incident detection
incident response
Abstract:As computing has become critical to all areas of modern life, the need to ensure the security and reliability of the underlying information technology infrastructures is greater than ever before. Large-scale enterprise and cloud systems, which form the backbone for the majority of computing activity, consist of many components and services interacting in complex and sometimes unpredictable ways. As such systems have grown in size, scale, and complexity, they have become increasingly difficult to protect against security and reliability incidents, resulting over recent years in ever more frequent service disruptions, failures, and data breaches, the financial and societal implications of which are massive. System owners have a strong desire to prevent such incidents. Incident detection and response and compliance audit are the two primary mechanisms by which organizations enforce reliability and security policies and make their systems more resilient. Both the academic and professional communities have focused considerable attention on developing techniques to improve incident detection, incident root cause analysis, and compliance auditing, often with little consideration for the cost of the monitoring that is required to support them. Furthermore, as the scale and complexity of systems have increased, so too have the scale and complexity of their monitoring infrastructures. Monitors can fail or be compromised, and monitor data must be selectively collected to avoid exceeding storage and processing limits. Consequently, it has become increasingly important to explicitly consider the efficiency, efficacy, and resiliency of monitoring systems when one is designing large-scale enterprise and cloud systems. In this dissertation, we address inefficiencies and inadequacies in reliability and security monitoring in enterprise and cloud systems by leveraging redundancy of information across diverse monitors. In particular, we use the redundancy of data generated by different monitors 1) to facilitate more effective and efficient use of the data in meeting reliability and security objectives, and 2) to improve the resiliency of the monitoring infrastructure itself against failures and attacks. First, we present a framework for simplifying the complexity of data analysis for incident response in enterprise cloud systems. As a foundation for the framework, we define a general taxonomy for fields within monitor data that administrators can use to label both structured and unstructured components of data. We then present a method to automatically extract time series features based on labels from our taxonomy, remove uninformative features, and reduce the overall number of features by clustering together related and redundant features. We apply our framework to logs and metrics collected during reliability incidents from all levels of an experimental platform-as-a-service cloud at a large computing organization, and demonstrate that our approach enables efficient coordinated analysis of both metric data and log data. Such analysis typically presents a challenge to cloud support engineers, but can identify meaningful relationships between features that can aid in incident response. Next, we present a systematic methodology that enables system administrators to design monitoring systems that are resilient to missing data. We develop a model-based approach to quantify the resilience of a system's monitoring and incident detection infrastructure design against missing data, using which we develop a method to find monitor deployments that maximize resilience subject to monitoring cost constraints. We illustrate how our approach can be applied to production systems by using a datacenter network case study model based on monitors employed in production systems, and we evaluate its scalability by using randomly generated models of varying sizes and structures. We compare our approach to the current state of the art and demonstrate that our approach consistently finds monitor deployments that are more resilient under the same constraints. Finally, we address the inefficiencies faced by a cloud service provider (CSP) during audit evidence collection as a result of a poor understanding of evidence requirements. We motivate our analysis by developing a taxonomic framework for understanding the causes of and potential solutions to uncertainty in audit. We present a model-driven method to learn evidence sufficiency requirements directly from historical audit records. We then apply our cost-optimal resilient monitoring approach to the evidence sufficiency model to determine an efficient evidence collection strategy for the CSP. We apply our approach to the historical audit records from an enterprise infrastructure-as-a-service cloud system at a large computing organization and demonstrate how use of our approach could have enabled more efficient evidence collection. We believe that our work clearly demonstrates the need to critically examine the resiliency and efficiency of monitoring infrastructures in enterprise and cloud systems. This dissertation presents solutions to specific challenges faced by practitioners when monitoring their systems for reliability and security objectives, but our work addresses only part of the larger problem space of resilient monitoring system design. We hope that this dissertation paves the way for future research that focuses on the resilience of the monitoring infrastructure itself.
Issue Date:2020-12-03
Type:Thesis
URI:http://hdl.handle.net/2142/109623
Rights Information:Copyright 2020 Uttam Thakore
Date Available in IDEALS:2021-03-05
Date Deposited:2020-12


This item appears in the following Collection(s)

Item Statistics