Home > Out-of-order Implementation


Out-of-Order Implementation

The processor architecture uses Tomasulo's algorithm [Tom67]. It gives a concrete hardware architecture for resolving inter-instruction dependencies through the register file, thereby allowing out-of-order issue to the functional units. The architecture is very simple but extremely powerful and capable of resolving all dependencies through the register file. It also provides a form of register renaming that allows the simultaneous or out-of-order execution of multiple reads and writes to the same register. The algorithm (as well as numerous variations on it) has become a staple in modern high-performance CPU design.

A reorder buffer as described in [SmiP85] is used for providing a mechanism to allow instructions that generate results out-of-order to be completed in-order. The reorder buffer described by Smith and Pleszkun is a circular queue, in which each entry holds all the instruction state necessary to carry one instruction through the various phases of instruction execution (operand fetch, execution, register file writeback, etc.). Instructions are placed into the queue and taken out of the queue in-order. The original implementation enqueues instructions at the time of instruction issue, i.e. once all the operands are available. They have described two variations: one in which the operands must be obtained though the register file and another in which operands can be read directly out of the reorder buffer itself if the latest copy of a register value is present in the result portion of a reorder buffer entry holding an instruction that has finished executing but has not yet written its result to the register file. Though instructions are enqueued in program order and only when their operands are available, while they are in the reorder buffer program order does not dictate the sequencing of any particular events: in particular, the instructions might finish executing out of program order.

When an instruction is successfully dequeued from the reorder buffer, its results are committed to the machine state. At this point, the instruction is said to be retired. While an instruction is in the process of execution, i.e. before retiring, it may cause an exceptional condition (invalid opcode, TLB miss, segmentation violation, trap instruction, etc.). To ensure that such exceptions are not handled speculatively or out-of-order, the handling of exceptional conditions is treated like the updating of machine state: exception handling does not occur until instruction commit time. This ensures that no previous instructions have caused as-yet-unhandled exceptions. Therefore, all exceptions are handled in program order, and no exception is handled for an instruction that ends up being discarded (as a result of a branch misprediction, for example).

Tomasulo’s algorithm provided a template for building a high performance machine with out-of-order issue and imprecise interrupts. A few years after the reorder buffer paper appeared, and twenty years after Tomasulo’s algorithm appeared, the problem of supporting precise interrupts was solved. Sohi and Vajapeyam describe in their paper [SohV87] the marriage of Tomasulo’s algorithm and the reorder buffer; they call their mechanism the register update unit (RUU). This mechanism provides both out-of-order issue to functional units and support for precise interrupts.

The RUU is a very intuitive merger of Tomasulo’s algorithm and the reorder buffer. It is a circular queue into which instructions are enqueued in-order and out of which instructions are dequeued in-order. Meanwhile, instruction operands are gathered according to Tomasulo’s algorithm, and instructions are sent to functional units as soon as all their operands are valid. The mechanism as described in the paper only supports single-instruction enqueue/issue/commit; however, this is not inherent to its design—it is easily extended to wider implementations.

The primary place of deviation with reorder buffer operation is at instruction enqueue: first and foremost, an instruction is enqueued as soon as an RUU slot is available, whether the instruction’s operands are available or not. Also, whereas in the reorder buffer scheme an instruction can read its operands out of the reorder buffer if available and out of the register file otherwise, in the RUU scheme an instruction never reads from other RUU entries directly. Rather, operands are gathered from the functional-unit result buses as in Tomasulo’s algorithm, and they are also gathered from the commit bus, which carries the result of the currently committing instruction to the register file. Therefore, if at the time of instruction-enqueue the most recent version of one of its operands is found in the result field of another instruction’s RUU entry, the operand does not become immediately available to the instruction as it would in the reorder buffer scheme, but it does become available as soon as the value is committed to the register file.

Figure 1 shows the pipeline organization of RiSC-16 which was the initial architecture of Niloofar 1. It's out-of-order design is very much like Sohi & Vajapeyam’s register update unit, except that it is twice as wide. The instruction encode/issue/commit logic is Tomasulo’s algorithm extended to handle dual-enqueue, dual-commit, and three-way issue to the execute units (two ALU, one memory). The mechanism handles branch mispredictions and interrupts by placing instructions not in a disjoint set of reservation stations but rather in a circular instruction queue that functions similarly to Smith & Pleszkun’s reorder buffer. Like the register update unit, the design departs from the reorder buffer in several ways. First, an instruction is enqueued as soon as an instruction-queue slot is available, whether the instruction’s operands are available or not. Second, instruction operands are not read from the instruction queue at enqueue time if the register file does not have the latest copy of a data word; rather, the instruction must wait until the data is available on the commit buses, the ALU buses, or the memory bus.

Click to enlarge
Figure1. Overview of the pipeline organization

Because the computer-architecture field cannot seem to agree on definitions for issue/dispatch, we do not use the term “dispatch” and instead use the term “enqueue” to mean placing an instruction into the instruction queue and the term “issue” to mean sending a ready instruction to one of the functional units. This is illustrated in Figure 1. The architecture is two-way fetch, two-way enqueue, three-way issue and execute (two ALU instructions and one memory instruction), and two-way commit. Branch prediction is a simple backward-taken/forward-not-taken algorithm. The instruction queue has eight entries and is integrated with the memory queue; therefore, the system effectively has eight miss-status holding registers (MSHRs), enabling the handling of up to eight outstanding cache misses. However, this is artificially limited to three simultaneous requests being handled, to better reflect current DRAM design: there are three entries in the memory queue, which handles the memory interface and thus limits the number of simultaneous requests.

Figure 1 also shows four of the phases through which all committed instructions pass: fetch, enqueue, issue, and execute. Each of these takes one cycle. In this document, they are not called “stages” because that term implies more rigid timing and ordering, whereas instructions in an out-of- order core spend variable lengths of time in each stage and can visit some stages out of program order. The next section describes each of the phases in more detail.

Next

Home

ISA

Phases Timing Branch Hazards Synthesis

References