|Abstract:||There is a large, emerging, and commercially relevant class of applications which stands to be enabled by a significant increase in parallel computing throughput. Moreover, continued scaling of semiconductor technology allows us the creation of architectures with tremendous throughput on a single chip. In this thesis, we examine the confluence of these emerging single-chip accelerators and the applications they enable. We examine the tradeoffs associated with accelerator architectures, working our way down the abstraction hierarchy of computing starting at the application level and concluding with the physical design of the circuits.
Research into accelerator architectures is hampered by the lack of standardized, readily available benchmarks. Among these applications is what we refer to as visualization, interaction, and simulation (VIS). These applications are ideally suited for accelerators because of their parallelizability and demand for high throughput. We present VISBench, a benchmark suite to serve as an experimental proxy for for VIS applications. VISBench contains a sampling of applications and application kernels from traditional visual computing areas such as graphics rendering and video encoding. It also contains a sampling of emerging application areas, such as computer vision and physics simulation, which are expected to drive the development of future accelerator architectures.
We use VISBench to examine some important high level decisions for an accelerator architecture. We propose a methodology to evaluate performance tradeoffs against chip area. We propose a memory system based on a cache incoherent shared address space along with mechanisms to provide synchronization and communication. We also examine GPU-style SIMD execution and find that a MIMD architecture is necessary to provide strong performance per area for some applications.
We analyze area versus performance tradeoffs in architecting the individual cores. We find that a design made of small, simple cores achieves much higher throughput than a general purpose uniprocessor. Further, we find that a limited amount of support for ILP within each core aids overall performance. We find that fine-grained multithreading improves performance, but only up to a point. We find that vector ALUs for SIMD instruction sets provide a poor performance to area ratio.
We propose a methodology for performing an integrated optimization of both the micro-architecture and the physical circuit design of the cores and caches. In our approach, we use statistical sampling of the design space for evaluating the performance of the micro-architecture and RTL synthesis to characterize the area-power-delay of the underlying circuits. This integrated methodology enables a much more powerful analysis of the performance-area and performance-power tradeoffs for the low level micro-architecture. We use this methodology to find the optimal design points for an accelerator architecture under area constraints and power constraints. Our results indicate that more complex architectures scale well in terms of performance per area, but that the addition of a power constraint favors simpler architectures.