Files in this item



application/pdfPENG-THESIS-2015.pdf (2MB)
(no description provided)PDF


Title:Elasticity and resource aware scheduling in distributed data stream processing systems
Author(s):Peng, Boyang
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Resource Aware Scheduling
Distributed Data Stream Processing
Abstract:The era of big data has led to the emergence of new systems for real-time distributed stream processing, e.g., Apache Storm is one of the most popular stream processing systems in industry today. However, Storm, like many other stream processing systems, lacks many important and desired features. One important feature is elasticity with clusters running Storm, i.e. change the cluster size on demand. Since the current Storm scheduler uses a naïve round robin approach in scheduling applications, another important feature is for Storm to have an intelligent scheduler that efficiently uses the underlying hardware by taking into account resource demand and resource availability when performing a scheduling. Both are important features that can make Storm a more robust and efficient system. Even though our target system is Storm, the techniques we have developed can be used in other similar stream processing systems. We have created a system called Stela that we implemented in Storm, which can perform on-demand scale-out and scale-in operations in distributed processing systems. Stela is minimally intrusive and disruptive for running jobs. Stela maximizes performance improvement for scale-out operations and minimally decrease performance for scale-in operations while not changing existing scheduling of jobs. Stela was developed in partnership with another Master’s Student, Le Xu [1]. We have created a system called R-Storm that does intelligent resource aware scheduling within Storm. The default round-robin scheduling mechanism currently deployed in Storm disregards resource demands and availability, and can therefore be very inefficient at times. R-Storm is designed to maximize resource utilization while minimizing network latency. When scheduling tasks, R-Storm can satisfy both soft and hard resource constraints as well as minimizing network distance between components that communicate with each other. The problem of mapping tasks to machines can be reduced to Quadratic Multiple 3-Dimensional Knapsack Problem, which is an NP-hard problem. However, our proposed scheduling algorithm within R-Storm attempts to bypass the limitation associated with NP-hard class of problems. We evaluate the performance of both Stela and R-Storm through our implementations of them in Storm by using several micro-benchmark Storm topologies and Storm topologies in use by Yahoo! In. Our experiments show that compared to Apache Storm’s default scheduler, Stela’s scale-out operation reduces interruption time to as low as 12.5% and achieves throughput that is 45-120% higher than Storm’s. And for scale-in operations, Stela achieves almost zero throughput post scale reduction while two other groups experience 200% and 50% throughput decrease respectively. For R-Storm, we observed that schedulings of topologies done by R-Storm perform on average 50%-100% better than that done by Storm’s default scheduler.
Issue Date:2015-04-27
Rights Information:Copyright 2015 Boyang Peng
Date Available in IDEALS:2015-07-22
Date Deposited:May 2015

This item appears in the following Collection(s)

Item Statistics