|Abstract:||As an effective way of utilizing data parallelism in applications, SIMD architecture has been adopted by most today's microprocessors. Using intrinsic functions and automatic compilation are common programming methods for today's SIMD devices. However, neither methods can provide enough programmability and performance at the same time. Many issues must be addressed to generate efficient SIMD code. For example, most SIMD devices only support memory accesses on contiguous and aligned sections. Additional permutation instructions are needed for non-contiguous and/or misaligned references. Such overhead can cancel all performance benefits from SIMD computation.
VINCI, or Vector I-code Novel Compilation Infrastructure, is proposed in this thesis. VINCI focuses on translating vector programs into efficient code for SIMD devices. Vectors in input programs can have arbitrary length, strides, and alignment settings. However, vectors required by SIMD devices must have the same fixed length, unit strides, and aligned addresses. VINCI employs a sequence of program transformations to convert all vectors into such specific format.
VINCI also includes several optimization algorithms. The optimization algorithm on data permutations is of great importance. By unifying all forms of data permutations into the explicit representation, the optimization algorithm can reduce the number of data permutations in vector programs by propagating them across statements and merging them whenever possible. In addition, an efficient code generation algorithm is included to generate native permutation instructions from vector permutation operations.
Besides, any common compiler analysis and optimizations were also extended for vector representation and included in VINCI. Two examples are def-use analysis and copy propagation. In addition, two domain-specific optimization techniques for DSP programs are also extended for vector programs. These optimizations are necessary to delivery the final performance on SIMD devices.
VINCI was implemented on the HiLO compiler, an internal compiler used in SPIRAL. Experiments were conducted on two platforms, VMX and SSE2. Testing applications include both automatically-generated programs and manually-written kernels. The results show that up to 77% of the permutation instructions are eliminated and, as a result, the average performance improvement is 48% on VMX and 68% on SSE2. For several applications, near perfect speedups have been achieved on both platforms.