Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf
Скачиваний:
12
Добавлен:
23.08.2013
Размер:
814.91 Кб
Скачать

10 First time versus repeated execution

A piece of code usually takes much more time the first time it is executed than when it is repeated. The reasons are the following:

1.Loading the code from RAM into the cache takes longer time than executing it.

2.Any data accessed by the code has to be loaded into the data cache, which may take much more time than executing the instructions. The data are more likely to be in the cache when the code is repeated.

3.Jump instructions will not be in the branch target buffer the first time they execute, and therefore are less likely to be predicted correctly. See chapter 12 page 35.

4.In many processors, decoding the length of each instruction is a bottleneck. The P1 solves this problem by remembering the length of any instruction which has remained in the cache since last time it was executed. As a consequence of this, a set of instructions will not pair in the P1 the first time they are executed, unless the first of the two instructions is only one byte long. The PMMX, PPro, P2 and P3 have no penalty on first time decoding. On the P4, instructions go directly from the decoder to the execution unit the first time they are executed. On subsequent executions they are submitted from the trace cache at the rate of 3 uops per clock.

For these four reasons, a piece of code inside a loop will generally take much more time the first time it executes than the subsequent times.

If you have a big loop which doesn't fit into the trace cache or code cache then you will get penalties all the time because it doesn't run from the cache. You should therefore try to reorganize the loop to make it fit into the cache.

If you have very many jumps, calls, and branches inside a loop, then you may get the penalty of branch target buffer misses repeatedly.

Likewise, if a loop repeatedly accesses a data structure too big for the data cache, then you will get the penalty of data cache misses all the time.

11 Out-of-order execution (PPro, P2, P3, P4)

The most important improvement that came with the sixth generation of microprocessors is out-of-order execution. The idea is that if the execution of a particular instruction is delayed because the input data are not available yet, then the microprocessor will try to find later instructions that it can do first, if the input data for the latter instructions are ready. Obviously, the microprocessor has to check if the latter instructions need the output from the former instruction. If each instruction depends on the result of the preceding instruction, then we have no opportunities for out-of-order execution. The logic for determining input dependences and the mechanisms for doing instructions as soon as the necessary inputs are ready, gives us the further advantage that the microprocessor can do several things at the same time. If we need to do an addition and a multiplication, and neither instruction depends on the output of the other, then we can do both at the same time, because they are using two different execution units. But we cannot do two multiplications at the same time, because both will need to use the same execution unit.

Typically, everything in these microprocessors is highly pipelined in order to improve the throughput. If, for example, a floating-point addition takes 5 clock cycles, and the execution unit is fully pipelined, then we can start one addition at time T, which will be finished at time T+5, and start another addition at time T+1, which will be finished at time T+6. The advantage of this technology is therefore highest if we can organize our code so that there are as few dependencies as possible between successive instructions.

11.1 Instructions are split into uops

The microprocessors with out-of-order execution are translating all instructions into microoperations - abbreviated uops. A simple instruction such as ADD EAX,EBX generates only one uop, while an instruction like ADD EAX,[MEM1] generates two: one for reading from memory into a temporary (unnamed) register, and one for adding the contents of the temporary register to EAX. The instruction ADD [MEM1],EAX generates three uops: one for reading from memory, one for adding, and one for writing the result back to memory. The advantage of this is that the uops can be executed out of order. Example:

MOV EAX, [MEM1]

IMUL EAX, 5

ADD EAX, [MEM2]

MOV [MEM3], EAX

Here, the ADD EAX,[MEM2] instruction is split into two uops. The advantage of this is that the microprocessor can fetch the value of [MEM2] at the same time as it is doing the multiplication. If none of the data are in the cache, then the microprocessor will start to fetch [MEM2] immediately after starting to fetch [MEM1], and long before the multiplication can start. This principle also makes the stack work more efficiently. Consider the example:

PUSH EAX

CALL FUNC

The PUSH EAX instruction is split into two uops which can be represented as SUB ESP,4 and MOV [ESP],EAX. The advantage of this is that the SUB ESP,4 uop can be executed even if the value of EAX is not ready yet. The CALL operation needs the new value of ESP, so the CALL would have to wait for the value of EAX if the PUSH instruction was not split into uops. Thanks to the use of uops, the value of the stack pointer almost never causes delays in normal programs.

½½

11.2 Register renaming

Consider the example:

MOV EAX, [MEM1]

IMUL EAX, 6

MOV [MEM2], EAX

MOV EAX, [MEM3]

ADD EAX, 2

MOV [MEM4], EAX

This piece of code is doing two things that have nothing to do with each other: multiplying [MEM1] by 6 and adding 2 to [MEM3]. If we were using a different register in the last three instructions, then the independence would be obvious. And, in fact, the microprocessor is actually smart enough to do just that. It is using a different temporary register in the last three instructions so that it can do the multiplication and the addition in parallel. The instruction set gives us only seven general-purpose 32-bit registers, and often we are using them all. So we cannot afford the luxury of using a new register for every calculation. But the microprocessor has more temporal registers to use. The PPro, P2 and P3 have 40 universal temporary registers, and the P4 has 128. The microprocessor can rename any of these temporary registers to represent the logical register EAX.

Register renaming works fully automatically and in a very simple way. Every time an instruction writes to or modifies a logical register, the microprocessor assigns a new temporary register to that logical register. The first instruction in the above example will

assign one temporary register to EAX. The second instruction is putting a new value into EAX, so a new temporary register will be assigned here. In other words, the multiplication instruction will use two different registers, one for input and another one for output. The next example illustrates the advantage of this:

MOV EAX, [MEM1]

MOV EBX, [MEM2]

ADD EBX, EAX

IMUL EAX, 6

MOV [MEM3], EAX

MOV [MEM4], EBX

Assume, now, that [MEM1] is in the cache, while [MEM2] is not. This means that the multiplication can start before the addition. The advantage of using a new temporary register for the result of the multiplication is that we still have the old value of EAX, which has to be kept until EBX is ready for the addition. If we had used the same register for the input and output of the multiplication, then the multiplication would have to wait until the loading of EBX and the addition was finished.

After all the operations are finished, the value in the temporary register that represents the last value of EAX in the code sequence is written to a permanent EAX register. This process is called retirement (see page 71 and 88).

All general purpose registers, stack pointer, flags, floating-point registers, MMX registers, XMM registers and segment registers can be renamed. Control words, and the floating-point status word cannot be renamed, and this is the reason why the use of these registers is slow.

11.3 Dependence chains

A series of instructions where each instruction depends on the result of the preceding one is called a dependence chain. Long dependence chains should be avoided, if possible, because they prevent out-of-order and parallel execution. Example:

MOV EAX, [MEM1]

IMUL EAX, [MEM2]

IMUL EAX, [MEM3]

IMUL EAX, [MEM4]

MOV [MEM5], EAX

In this example, the IMUL instructions generate several uops each, one for reading from memory, and one or more for multiplying. The read uops can execute out or order, while the multiplication uops must wait for the previous uops to finish. This dependence chain takes quite a long time, because multiplication is slow. You can improve this code by breaking up the dependence chain. The trick is to use multiple accumulators:

MOV

EAX, [MEM1]

; start first

chain

MOV

EBX, [MEM2]

; start other

chain in different accumulator

IMUL EAX, [MEM3]

 

 

IMUL EBX, [MEM4]

 

 

IMUL EAX, EBX

; join chains

in the end

MOV [MEM5], EAX

 

 

Here, the second IMUL instruction can start before the first one is finished.

Floating-point instructions often have a longer latency than integer instructions, so you should definitely break up long dependence chains with floating-point instructions:

FLD [MEM1]

; start first chain