Files in this item

FilesDescriptionFormat

application/pdf

application/pdfLIU-DISSERTATION-2019.pdf (7MB)Restricted to U of Illinois
(no description provided)PDF

Description

Title:Energy efficient computing exploiting data similarity and computation redundancy
Author(s):Liu, Zhenhong
Director of Research:Kim, Nam Sung
Doctoral Committee Chair(s):Hwu, Wen-Mei
Doctoral Committee Member(s):Torrellas, Josep; Shanbhag, Naresh R.
Department / Program:Electrical & Computer Eng
Discipline:Electrical & Computer Engr
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):Approximate computing
GPU architecture
Memoization
Computation redundancy
Abstract:Applications in various fields, such as machine learning, scientific computing and signal/image processing, need to deal with real-world input datasets. Such input datasets are usually discrete samples of slow-changing, continuous data of physical phenomena, like temperature maps and images. Due to the continuous nature of the physical phenomena, these datasets often contain data points with similar or even identical values. Eventually, it results in repeated operations performed on the same or similar data points, i.e. redundant computation. Redundant computation can be exploited to improve the energy/power efficiency and performance of processors, especially when the benefits from process technology scaling and power scaling keep diminishing. This dissertation first proposes a cost-effective generalized scalar execution architecture for GPUs, called G-Scalar. It exploits the redundant computation performed on identical data. G-Scalar offers two key advantages over prior architectures supporting scalar execution for only non-divergent arithmetic/logic instructions. First, G-Scalar is more power efficient as it can also support scalar execution of divergent and special-function instructions, the fraction of which in contemporary GPU applications has notably increased. Second, G-Scalar is less expensive as it can share most of its hardware resources with register value compression, of which adoption has been strongly promoted to reduce high power consumption of accessing the large register file. Compared with the baseline and previous scalar architectures, G-Scalar improves power efficiency by 24% and 15%, respectively, at a negligible cost. Lock and Load (LnL) architecture is then proposed to extend the coverage to redundant computation on similar data, by enabling approximate computing. In the LnL architecture, approximate computing is triggered by similarity of values returned by load instructions, and then approximation is applied to the annotated code region following the load instructions. Such a design reduces the overhead of checking eligibility of approximation for every instruction and allows us to deploy more sophisticated techniques for checking the eligibility of approximation and approximating the output values for all the skipped threads at the end. LnL is further enhanced to fuse the same approximated instructions from multiple warps into a single instruction, exploiting the fact that only a subset of threads in approximated warps are executed and many execution lanes are left unused by the skipped threads. This not only improves the performance but also reduces energy consumption, as it reduces the number of fetched, decoded, scheduled and executed instructions. Our experiment shows that LnL can improve energy efficiency and performance by 62% and 23%, respectively. Finally, we propose AxMemo to exploit the computation redundancy on CPUs. Inspired by LnL, AxMemo focuses on memoizing relatively large blocks of code with a variable number of inputs. In contrast, existing memoization techniques mostly replace costly floating-point operations that have a limited number of inputs with memory lookup. Since AxMemo aims to replace long sequences of instructions with a few lookup operations, it alleviates the von Neumann and execution overheads of passing instructions through the processor pipeline altogether. To address the challenge of handling various numbers of inputs, we develop a novel use of Cyclic Redundancy Checking (CRC) to hash the inputs and use the hash as lookup tags. It enables AxMemo to efficiently memoize relatively large code regions with variable input sizes and types using the same underlying hardware. Our experiment shows that AxMemo offers 2.82× speedup and 63% energy reduction, with mere 0.2% of quality loss averaged across ten benchmarks. These benefits come with an area overhead of just 2.08%.
Issue Date:2019-04-11
Type:Text
URI:http://hdl.handle.net/2142/104973
Rights Information:Copyright 2019 Zhenhong Liu
Date Available in IDEALS:2019-08-23
Date Deposited:2019-05


This item appears in the following Collection(s)

Item Statistics