|Abstract:||Instruction Level Distributed Processing (ILDP) is a microarchitectural technique that distributes execution, at the granularity of individual instructions, among a number of small, independent processing elements (PEs). In aggregate, it thereby matches the execution resources of a large-window, wide-issue superscalar, but it circumvents the clock-, thermal- and power-related problems that are encountered when scaling conventional (monolithic) designs to the same dimensions. However, its distributed mode of operation introduces a number of constraints on the ability to dynamically find and exploit instruction-level parallelism (ILP). Though a very large body of research has sought to overcome those constraints, no study has, to date, been able to demonstrate instructions per clock (IPC) performance close to that of an equivalent monolithic machine. Neither, however, has any study been able to conclusively demonstrate that IPC losses must necessarily be incurred.
This dissertation undertakes a detailed exploration of ILDP. It seeks to understand how, and to what extent, a distributed mode of operation limits the inherent performance potential of a variety of designs. Where potential is found to be good, it further seeks to develop a set of practical schemes for exploiting that potential to the fullest. Its main contribution is to adopt a more general, principled approach than prior work in this area, and to thereby expose a hitherto unknown set of very basic factors at work in ILDP machines. The practical schemes it develops, which derive directly from an understanding of those basic factors, yield the best performance figures for ILDP machines that have been published to date.
The ILDP design space is divided into two parts, the first comprising machines that distribute execution among in-order PEs; the second, among out-of-order PEs. These two classes, called, respectively, the laned and the clustered machines, have very different performance potential. The laned machines, though appealing from a clock and power point of view, are fundamentally limited in terms of their ability to exploit ILP. This is because their out-of-order execution capabilities, being now reduced to "slip" among their in-order PEs, are too coarse-grained to deal effectively with the complex and irregular manner in which ILP presents itself in the instruction stream. By means of an abstract model of laned machine operation, it is shown that matching the IPC performance of a monolithic machine demands either an impossibly complex mechanism for distributing instructions among PEs, or so many PEs that communication among them would overwhelm the machine.
The clustered machines, by contrast, are shown to be inherently capable of matching monolithic machine performance, the penalties imposed by distributed execution notwithstanding. Key to exploiting that potential is knowledge of the critical path through a program. This can be used to achieve a judicious allocation of execution resources to instructions, with performance-critical instructions being shielded from the distributed machine's execution constraints; only the least important instructions, which can tolerate some delay, need be exposed to those constraints. This dissertation develops several novel critical path-aware schemes, and shows that they can deliver performance that is within a few percent of a monolithic machine. It further shows that many aspects of those schemes are stable, both within and across runs of a program, a property which lends them to implementation in a static (offline) context.