Files in this item



application/pdf8924954.pdf (6MB)Restricted to U of Illinois
(no description provided)PDF


Title:Performance evaluation of vector machine architectures
Author(s):Tang, Ju-ho
Doctoral Committee Chair(s):Davidson, Edward S.
Department / Program:Electrical and Computer Engineering
Discipline:Electrical Engineering
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Engineering, Electronics and Electrical
Abstract:Vector machines are well known for their high-peak performance, but the delivered performance varies greatly over different workloads and depends strongly on compiler optimizations. Recently it has been claimed that several horizontal superscalar architectures, e.g., VLIW and polycyclic architectures, provide a more balanced performance across a wider range of scientific workloads than do vector machines. The purpose of this research is to study the performance of register-register vector processors, such as Cray supercomputers, as a function of their architectural features, scheduling schemes, compiler optimization capabilities, and program parameters. The results of this study also provide a base for comparing vector machines with horizontal superscalar machines.
An evaluation methodology, based on timing parameters, bottlenecks, and run time bounds, is developed. Cray-1 performance is degraded by the multiple memory loads of index-misaligned vectors and the inability of the Cray Fortran Compiler (CFT) to produce code that hits all the chain slot times. The Cray X-MP processor has three memory ports and supports flexible chaining, but its vector register reservation scheme poses a problem for the current CFT compilers, thereby reducing execution concurrency. The causes of the performance differences of two Cray Fortran compilers, CFT1.14 and CFT77(1.3), on the vectorized Livermore Fortran Kernels (LFKs) are discovered and some areas for further improvement are suggested.
The impact of chaining and two instruction scheduling schemes on one-memory-port vector supercomputers, illustrated by the Cray-1 and Cray-2, is studied. The lack of instruction chaining on the Cray-2 requires a different instruction scheduling scheme from that of the Cray-1. Situations are characterized in which simple vector scheduling can generate code that fully utilizes one functional unit for machines with chaining. Even without chaining, polycyclic scheduling guarantees full utilization of one functional unit, after an initial transient, for loops with acyclic dependence graphs.
The effectiveness of applying polycyclic vector scheduling (PVS) to the Cray-2 is compared with optimal simple vector scheduling on the Cray-1. More than 30% performance improvement on several vectorized LFKs is achieved by PVS over the current CFT77(2.0) compiler on the Cray-2. Some hardware modifications that could improve the effectiveness of applying PVS are evaluated.
Issue Date:1989
Rights Information:Copyright 1989 Tang, Ju-ho
Date Available in IDEALS:2011-05-07
Identifier in Online Catalog:AAI8924954
OCLC Identifier:(UMI)AAI8924954

This item appears in the following Collection(s)

Item Statistics