Home > Out-of-order Implementation |
|
The processor architecture uses Tomasulo's algorithm [Tom67]. It gives a concrete hardware architecture for resolving inter-instruction dependencies through the register file, thereby allowing out-of-order issue to the functional units. The architecture is very simple but extremely powerful and capable of resolving all dependencies through the register file. It also provides a form of register renaming that allows the simultaneous or out-of-order execution of multiple reads and writes to the same register. The algorithm (as well as numerous variations on it) has become a staple in modern high-performance CPU design.
A reorder buffer as described in [SmiP85] is used for providing a mechanism to allow instructions that generate results out-of-order to be completed in-order. The reorder buffer described by Smith and Pleszkun is a circular queue, in which each entry holds all the instruction state necessary to carry one instruction through the various phases of instruction execution (operand fetch, execution, register file writeback, etc.). Instructions are placed into the queue and taken out of the queue in-order. The original implementation enqueues instructions at the time of instruction issue, i.e. once all the operands are available. They have described two variations: one in which the operands must be obtained though the register file and another in which operands can be read directly out of the reorder buffer itself if the latest copy of a register value is present in the result portion of a reorder buffer entry holding an instruction that has finished executing but has not yet written its result to the register file. Though instructions are enqueued in program order and only when their operands are available, while they are in the reorder buffer program order does not dictate the sequencing of any particular events: in particular, the instructions might finish executing out of program order.
When an instruction is successfully dequeued
from the reorder buffer, its results are committed to the machine state. At this
point, the instruction is said to be retired. While an instruction is in the
process of execution, i.e. before retiring, it may cause an exceptional
condition (invalid opcode, TLB miss, segmentation violation, trap instruction,
etc.). To ensure that such exceptions are not handled speculatively or
out-of-order, the handling of exceptional conditions is treated like the
updating of machine state: exception handling does not occur until instruction
commit time. This ensures that no previous instructions have caused
as-yet-unhandled exceptions. Therefore, all exceptions are handled in program
order, and no exception is handled for an instruction that ends up being
discarded (as a result of a branch misprediction, for example).
Tomasulo’s algorithm provided a template for building a high performance machine
with out-of-order issue and imprecise interrupts. A few years after the reorder
buffer paper appeared, and twenty years after Tomasulo’s algorithm appeared, the
problem of supporting precise interrupts was solved. Sohi and Vajapeyam describe
in their paper [SohV87] the marriage of Tomasulo’s algorithm and the reorder
buffer; they call their mechanism the register update unit (RUU). This mechanism
provides both out-of-order issue to functional units and support for precise
interrupts.
The RUU is a very intuitive merger of Tomasulo’s algorithm and the reorder
buffer. It is a circular queue into which instructions are enqueued in-order and
out of which instructions are dequeued in-order. Meanwhile, instruction operands
are gathered according to Tomasulo’s algorithm, and instructions are sent to
functional units as soon as all their operands are valid. The mechanism as
described in the paper only supports single-instruction enqueue/issue/commit;
however, this is not inherent to its design—it is easily extended to wider
implementations.
The primary place of deviation with reorder buffer operation is at instruction
enqueue: first and foremost, an instruction is enqueued as soon as an RUU slot
is available, whether the instruction’s operands are available or not. Also,
whereas in the reorder buffer scheme an instruction can read its operands out of
the reorder buffer if available and out of the register file otherwise, in the
RUU scheme an instruction never reads from other RUU entries directly. Rather,
operands are gathered from the functional-unit result buses as in Tomasulo’s
algorithm, and they are also gathered from the commit bus, which carries the
result of the currently committing instruction to the register file. Therefore,
if at the time of instruction-enqueue the most recent version of one of its
operands is found in the result field of another instruction’s RUU entry, the
operand does not become immediately available to the instruction as it would in
the reorder buffer scheme, but it does become available as soon as the value is
committed to the register file.
Figure 1 shows the pipeline organization of RiSC-16 which was the initial architecture of Niloofar 1. It's out-of-order design is very much like Sohi & Vajapeyam’s register update unit, except that it is twice as wide. The instruction encode/issue/commit logic is Tomasulo’s algorithm extended to handle dual-enqueue, dual-commit, and three-way issue to the execute units (two ALU, one memory). The mechanism handles branch mispredictions and interrupts by placing instructions not in a disjoint set of reservation stations but rather in a circular instruction queue that functions similarly to Smith & Pleszkun’s reorder buffer. Like the register update unit, the design departs from the reorder buffer in several ways. First, an instruction is enqueued as soon as an instruction-queue slot is available, whether the instruction’s operands are available or not. Second, instruction operands are not read from the instruction queue at enqueue time if the register file does not have the latest copy of a data word; rather, the instruction must wait until the data is available on the commit buses, the ALU buses, or the memory bus.

Figure1. Overview of the pipeline organization
Because the computer-architecture field cannot seem to agree on definitions for issue/dispatch, we do not use the term “dispatch” and instead use the term “enqueue” to mean placing an instruction into the instruction queue and the term “issue” to mean sending a ready instruction to one of the functional units. This is illustrated in Figure 1. The architecture is two-way fetch, two-way enqueue, three-way issue and execute (two ALU instructions and one memory instruction), and two-way commit. Branch prediction is a simple backward-taken/forward-not-taken algorithm. The instruction queue has eight entries and is integrated with the memory queue; therefore, the system effectively has eight miss-status holding registers (MSHRs), enabling the handling of up to eight outstanding cache misses. However, this is artificially limited to three simultaneous requests being handled, to better reflect current DRAM design: there are three entries in the memory queue, which handles the memory interface and thus limits the number of simultaneous requests.
Figure 1 also shows four of the phases through which all committed instructions pass: fetch, enqueue, issue, and execute. Each of these takes one cycle. In this document, they are not called “stages” because that term implies more rigid timing and ordering, whereas instructions in an out-of- order core spend variable lengths of time in each stage and can visit some stages out of program order. The next section describes each of the phases in more detail.
| Phases | Timing | Branch Hazards | Synthesis |