The PowerPC 620 is a 64-bit processor that employs a two-phase branch predication technique, dynamic register renaming, multi-entry reservation stations, six execution units, and a completion buffer for precise exceptions. In order to evaluate the 620’s performance, the Visualization-based Microarchitecture Workbench (VMW) was used for simulating the 620 microprocessor. The evaluation was based on seven benchmarks taken from SPEC92.
The 620 architecture has 32 integer registers and 32 floating-point registers. It has a 32-bit condition register which can hold up to eight 4-bit fields. The architecture has other registers as well, such as count, and link registers mainly used for branch instructions, an integer exception register, and one floating-point status register. It has six execution units: two simple integer units, one complex multi-cycle integer unit, one floating-point unit, one load-store unit, and a branch unit. It also implements register renaming and reservation stations for each execution unit.

The PowerPC 620 Microprocessor
The 620 uses a typical five stage pipeline. The stages are: Fetch, Dispatch, Execute, Complete, and Writeback stages. A summarized description for each stage is given below.
Fetch stage: fetch up to four instructions from I-cache to the instruction buffer. A preliminary branch prediction is made during this stage using a branch target address buffer.
Dispatch stage: Decodes instructions, and checks if they can be dispatched to the reservation stations. If so, it allocates reservation stations as needed, as well as completion buffer entries and a rename buffer entries. Each execution unit can accept one instruction per cycle.
Execute stage: Many execution units pipeline the execution, e.g. the floating-point unit uses 3 pipeline stages. After the instruction is executed, the result is sent to the destination rename buffer and forwarded to any waiting instruction, and the instruction is marked as finished.
Complete stage: Finished instructions, up to four, are removed from the completion buffer in-order and passed to the Writeback stage. By holding instructions in the completion buffer until writeback, it is guaranteed that the registers hold the correct state up to the most recently completed instruction.
Writeback stage: Commits completed instruction results from the rename buffers to the register files.
The 620 uses two buffers between the pipeline stages. One is the instruction buffer that holds instruction between the fetch and dispatch stages. The other one is the completion buffer which records the state of instructions that are not complete yet. This insures precise exception execution. The instruction buffer has a length of eight, while the completion buffer has a length of sixteen. Each execution unit has its own reservation station. Reservation stations hold two to four instructions waiting to execute. Each instruction waits in a reservation station until all of its source operands have been read or forwarded and the execution unit is available.
VMW was used to simulate the 620 architecture. A description file for the 620 was provided to the VMW, along with seven SPEC92 benchmarks, namely compress, eqntott, espresso, li for integer, and alvinn, hydro2d, and tomcatv for floating-point. VMW uses a trace-driven simulation, which is fast, but some of the long latency i.e. divide, instructions cannot be simulated accurately. In the simulation a 32KB, 8-way set associative I-cache and 32KB 2-way interleaved D-cache were used. A cache miss latency of eight cycles, and a perfect L2 cache, were assumed.
The table below summarizes the experimentation results for both integer benchmarks and floating-point benchmarks.
|
Benchmark |
IPC |
Branch pred. accuracy |
% dispatch stalls |
% issue stalls |
I-cache miss |
D-cache miss |
|
Avg. Int. |
1.23 |
89.9% |
67.7% |
43.4% |
.18% |
1.65% |
|
Avg. FP |
1.26 |
97.1% |
82% |
22.5% |
.05% |
3.0% |
Summary of the simulation evaluation of the 620 Microprocessor
Although the 620 issues 4 instructions per cycle, the average effective IPC for both integer and floating-point benchmarks is less than 1.3. This was due to lots of stalls in the pipeline stages. We summarize some causes of these stalls below.
First, for branch stalls, the 620 uses a 256 entry, 2-way set associative branch target address table in the early fetch stage, and a 2 bit 2048-entry directed-mapped branch history table “BHT” in the dispatch stage. The overall branch prediction depends mainly on the BHT since, if there is different prediction the one from BHT is taken. The branch prediction for the floating-point does quite well, predicting 97% of all branches correctly in execution, but it needs some enhancements for the integer case, which achieved only 89.9% accuracy. Adding the information that 620 can speculate up to four branches, branch prediction becomes critical.
Second, in the dispatch stage, the 620 uses an in-order-policy to advance instructions from the instruction buffer to the reservation stations. Stalls may occur in many different ways in the dispatch stage. We list them below.
Utilization of the load-store unit’s three reservation station entries averages 1.36 to 1.73 entries for integer benchmarks and 0.98 to 2.26 entries for floating-point benchmarks. Unlike the other execution units, the load-store unit does not deallocate a reservation station entry as soon as an instruction is issued. The reservation station entry is held until the instruction is finished, usually two cycles after the instruction is issued. The in-order issue constraint of the floating-point unit and the non-pipelining of some floating-point instructions prevent some ready instructions from issuing as well.
Issue stalls also arose. Once instructions have been dispatched to reservation stations, they must wait for their source operands to become available, and then begin execution. There are a few other constraints, however. Below is a list of issuing hazards.
Most structural issue stalls (due to waiting for an execution unit) occur in the load-store unit. More load-store instructions are ready to execute than what the load-store execution unit can accommodate.
The overall execution time that typically spent by an instruction is summarized below. For each execution unit, the overall execution latency is the actual execution time of an instruction plus the waiting time in the reservation stations.
|
Execution unit |
Avg. Integer |
Avg Floating-point |
|
Integer unit 1 |
1.8 |
1.2 |
|
Integer unit 2 |
2.0 |
1.3 |
|
Complex integer unit |
4.5 |
4.9 |
|
Floating-point unit |
-- |
5.5 |
|
Load-Store unit |
3.0 |
2.7 |
|
Branch unit |
3.0 |
3.2 |
Overall Average execution latency (in cycles)
The 620’s main strengths are the branch prediction mechanisms, and the out-of-order execution units. The 620 does reasonably well on branch prediction. For the floating-point benchmarks, about 97% of the branches are resolved or correctly predicted, incurring little or no penalty cycles. Integer benchmarks are worse. The average drops down to 89%. More sophisticated prediction algorithms can increase prediction accuracy. Even with the precise interrupts support, the out-of-order execution in the 620 is still able to achieve a reasonable degree of instruction-level parallelism, with average IPC of 1.23 for integer benchmarks and 1.26 for floating-point benchmarks.
There are a number of bottlenecks raised in the 620 architecture. One is the load/store unit. The number of load/store reservation station entries and the number of Load-store units needs to be increased. Having only one floating-point unit for three integer units is another bottleneck. The integer benchmarks rarely stall on the integer units, but the floating-point benchmarks do stall waiting for floating-point resources. The single dispatch to each reservation station in a cycle is also a source of dispatch stalls which can reduce the number of instructions available for out-of-order execution. The 620 implements distributed reservation stations; other choice is to use centralized reservation stations. Although distributed reservation stations permit simpler hardware in that they need only be single-ported, the centralized approach can share multiple ports and the reservation station entries among different instruction types.