Files in this item

FilesDescriptionFormat

application/pdf

application/pdfRini_Kaushik.pdf (3MB)
(no description provided)PDF

Description

Title:GreenHDFS: data-centric and cyber-physical energy management system for big data clouds
Author(s):Kaushik, Rini
Director of Research:Nahrstedt, Klara
Doctoral Committee Chair(s):Nahrstedt, Klara
Doctoral Committee Member(s):Arpaci-Dusseau, Remzi; Abdelzaher, Tarek F.; Kale, Laxmikant V.
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):Hadoop Distributed File System (HDFS)
Big Data
Big Data Analytics
Energy
Energy Conservation
Cooling
Cooling Energy Costs Savings
Total Cost of Ownership
Server Operating Energy Costs Savings
Hadoop
Self-Adaptive
Predictive
Data-Centric
Cyber-Physical
File System
Abstract:Explosion in Big Data has led to a rapid increase in the popularity of Big Data analytics. With the increase in the sheer volume of data that needs to be stored and processed, storage and computing demands of the Big Data analytics workloads are growing exponentially, leading to a surge in extremely large-scale Big Data cloud platforms, and resulting in burgeoning energy costs and environmental impact. The sheer size of Big Data lends it significant data movement inertia and that coupled with the network bandwidth constraints inherent in the cloud's cost-efficient and scale-out economic paradigm, makes data-locality a necessity for high performance in the Big Data environments. Instead of sending data to the computations as has been the norm, computations are sent to the data to take advantage of the higher data-local performance. The state-of-the-art run-time energy management techniques are job-centric in nature and rely on thermal- and energy-aware job placement, job consolidation, or job migration to derive energy costs savings. Unfortunately, data-locality requirement of the compute model limits the applicability of the state-of-the-art run-time energy management techniques as these techniques are inherently data-placement-agnostic in nature, and provide energy savings at significant performance impact in the Big Data environment. Big Data analytics clusters have moved away from shared network attached storage (NAS) or storage area network (SAN) model to completely clustered, commodity storage model that allows direct access path between the storage servers and the clients in interest of high scalability and performance. The underlying storage system distributes file chunks and replicas across the servers for high performance, load-balancing, and resiliency. However, with files distributed across all servers, any server may be participating in the reading, writing, or computation of a file chunk at any time. Such a storage model complicates scale-down based power-management by making it hard to generate significant periods of idleness in the Big Data analytics clusters. GreenHDFS is based on the observation that data needs to be a first-class object in energy management in the Big Data environments to allow high data access performance. GreenHDFS takes a novel data-centric, cyber-physical approach to reduce compute (i.e., server) and cooling operating energy costs. On the physical-side, GreenHDFS is cognizant that all-servers-are-not-alike in the Big Data analytics cloud and is aware of the variations in the thermal-profiles of the servers. On the cyber-side, GreenHDFS is aware that all-data-is-not-alike and knows the differences in the data-semantics (i.e., computational jobs arrival rate, size, popularity, and evolution life spans) of the Big Data placed in the Big Data analytics cloud. Armed with this cyber-physical knowledge, and coupled with its insights, predictive data models, and run-time information GreenHDFS does proactive, cyber-physical, thermal- and energy-aware file placement, and data-classification-driven scale-down, which implicitly results in thermal- and energy-aware job placement in the Big Data analytics cloud compute model. GreenHDFS's data-centric energy- and thermal-management approach results in a reduction in energy costs without any associated performance impact, allows scale-down of a subset of servers in spite of the unique challenges posed by Big Data analytics cloud to scale-down, and ensures thermal-reliability of the servers in the cluster. GreenHDFS evaluation results with one-month long real-world traces from a production Big Data analytics cluster at Yahoo! show up to 59% reduction in the cooling energy costs while performing 9x better than the state-of-the-art data-agnostic cooling techniques, up to a 26% reduction in the server operating energy costs, and significant reduction in the total cost of ownership (TCO) of the Big Data analytics cluster. GreenHDFS provides a software-based mechanism to increase energy-proportionality even with non-energy-proportional server components. Free-cooling or air- and water-side economization (i.e., use outside air or natural water resources to cool the data center) is gaining popularity as it can result in significant cooling energy costs savings. There is also a drive towards increasing the cooling set point of the cooling systems to make them more efficient. If the ambient temperature of the outside air or the cooling set point temperature is high, the inlet temperatures of the servers get high which reduces their ability to dissipate computational heat, resulting in an increase in server temperatures. The servers are rated to operate safely only with a certain temperature range, beyond which the failure rates increase. GreenHDFS considers the differences in the thermal-reliability-driven load-tolerance upper-bound of the servers in its predictive thermal-aware file placement and places file chunks in a manner that ensures that temperatures of servers don't exceed temperature upper-bound. Thus, by ensuring thermal-reliability at all times and by lowering the overall temperature of the servers, GreenHDFS enables data centers to enjoy energy-saving economizer mode for longer periods of time and also enables an increase in the cooling set point. There are a substantial number of data centers that still rely fully on traditional air-conditioning. These data centers can not always be retrofitted with the economizer modes or hot- and cold-aisle air containment as incorporation of the economizer and air containment may require space for duct-work, and heat exchangers which may not be available in the data center. Existing data centers may also not be favorably located geographically; air-side economization is more viable in geographic locations where ambient air temperatures are low for most part of the year and humidity is in the tolerable range. GreenHDFS provides a software-based approach to enhance the cooling-efficiency of such traditional data centers as it lowers the overall temperature in the cluster, makes the thermal-profile much more uniform, and reduces hot air recirculation, resulting in lowered cooling energy costs.
Issue Date:2013-02-03
URI:http://hdl.handle.net/2142/42373
Rights Information:Copyright 2012 Rini Kaushik
Date Available in IDEALS:2013-02-03
Date Deposited:2012-12


This item appears in the following Collection(s)

Item Statistics