Files in this item

FilesDescriptionFormat

application/pdf

application/pdfKALIM-DISSERTATION-2020.pdf (6MB)Restricted to U of Illinois
(no description provided)PDF

Description

Title:Satisfying service level objectives in stream processing systems
Author(s):Kalim, Faria
Director of Research:Gupta, Indranil
Doctoral Committee Chair(s):Gupta, Indranil
Doctoral Committee Member(s):Nahrstedt, Klara; Xu, Tianyin; Ananthanarayanan, Ganesh
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):Stream Processing
Service Level Objectives
Bottlenecks
Distributed Systems
Abstract:An increasing number of real-world applications today consume massive amounts of data in real-time to produce up to date results. These applications include social media sites that show top trends and recent comments, streaming video analytics that identify traffic patterns and movement, and jobs that process ad pipelines. This has led to the proliferation of stream processing systems that process such data to produce real-time results. As these applications must produce results quickly, users often wish to impose performance requirements on the stream processing jobs, in the form of service level objectives (SLOs) that include producing results within a specified deadline or producing results at a certain throughput. For example, an application that identifies traffic accidents can have tight latency SLOs as paramedics may need to be informed, where given a video sequence, results should be produced within a second. A social media site could have a throughput SLO where top trends should be updated with all received input per minute. Satisfying job SLOs is a hard problem that requires tuning various deployment parameters of these jobs. This problem is made more complex by challenges such as 1) job input rates that are highly variable across time e.g., more traffic can be expected during the day than at night, 2) transparent components in the jobs' deployed structure that the job developer is unaware of, as they only understand the application-level business logic of the job, and 3) different deployment environments per job e.g., on a cloud infrastructure vs. on a local cluster. In order to handle such challenges and ensure that SLOs are always met, developers often over-allocate resources to jobs, thus wasting resources. In this thesis, we show that SLO satisfaction can be achieved by resolving (i.e., preventing or mitigating) bottlenecks in key components of a job's deployed structure. Bottlenecks occur when tasks in a job do not have sufficient allocation of resources (CPU, memory or disk), or when the job tasks are assigned to machines in a way that does not preserve locality and causes unnecessary message passing over the network, or when there are an insufficient number of tasks to process job input. We have built three systems that tackle the challenges of satisfying SLOs of stream processing jobs that face a combination of these bottlenecks in various environments. We have developed Henge, a system that achieves SLO satisfaction of stream processing jobs deployed on a multi-tenant cluster of resources. As the input rates of jobs change dynamically, Henge makes cluster-level resource allocation decisions to continually meet jobs' SLOs in spite of limited cluster resources. Second, we have developed Meezan, a system that aims to remove the burden of finding the ideal resource allocation of jobs deployed on commercial cloud platforms, in terms of performance and cost, for new users of stream processing. When a user submits their job to Meezan, it provides them with a spectrum of throughput SLOs for their jobs, where the most performant choice is associated with the highest resource usage and consequently cost, and vice versa. Finally, we have built Caladrius in collaboration with Twitter that enables users to model and predict how input rates of jobs may change in the future. This allows Caladrius to preemptively scale a job out when it anticipates high workloads to prevent SLO misses. Henge is built atop Apache Storm, while Meezan and Caladrius are integrated with Apache Heron.
Issue Date:2020-07-14
Type:Thesis
URI:http://hdl.handle.net/2142/108600
Rights Information:2020 Faria Kalim
Date Available in IDEALS:2020-10-07
Date Deposited:2020-08


This item appears in the following Collection(s)

Item Statistics