Files in this item

FilesDescriptionFormat

application/pdf

application/pdfGUPTA-THESIS-2017.pdf (1MB)
(no description provided)PDF

Description

Title:Exploration of fault tolerance in Apache Spark
Author(s):Gupta, Akshun
Advisor(s):Gupta, Indranil
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:M.S.
Genre:Thesis
Subject(s):Apache Spark
Fault tolerance
Abstract:This thesis provides an exploration of two techniques for solving fault tolerance for batch processing in Apache Spark. We evaluate the benefits and challenges of these approaches. Apache Spark is a cluster computing system comprised of three main components: the driver program, the cluster manager, and the worker nodes. Spark already tolerates the loss of worker nodes, and other external tools already provide fault tolerance solutions for the cluster manager. For example, the cluster manager deployed using Apache Mesos provides fault tolerance to the cluster manager. Spark does not support driver fault tolerance for batch processing. The driver program stores critical state of the running job by maintaining oversight of the workers; failure of the driver program always results in loss of all oversight over the worker nodes and is equivalent to catastrophic failure of the entire Spark application. In this thesis, we explore two approaches to achieve fault tolerance in Apache Spark for batch processing, enabling promised execution of long-running critical jobs and consistent performance while still supporting high uptime. The first approach serializes critical state of the driver program and relay that state to passive processors. Upon failure, this state is loaded by a secondary processor and computation is resumed. The second approach narrows the scope of the problem and synchronizes block information between primary and secondary drivers so that locations of cached aggregated data is not lost after primary driver failure. Loss of these locations leads to a state from which computation cannot be resumed. Both approaches propose considerable changes to the Apache Spark architecture in order to support high availability of batch processing jobs.
Issue Date:2017-12-06
Type:Text
URI:http://hdl.handle.net/2142/99383
Rights Information:Copyright 2017 Akshun Gupta
Date Available in IDEALS:2018-03-13
Date Deposited:2017-12


This item appears in the following Collection(s)

Item Statistics