Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf
Скачиваний:
12
Добавлен:
23.08.2013
Размер:
814.91 Кб
Скачать

with prediction in the return stack buffer. Note also that the ADD ESP instruction can cause an AGI stall in earlier processors.

14.5 Retirement

Retirement is a process where the temporary registers used by the uops are copied into the permanent registers EAX, EBX, etc. When a uop has been executed, it is marked in the ROB as ready to retire.

The retirement station can handle three uops per clock cycle. This may not seem like a problem because the throughput is already limited to 3 uops per clock in the RAT. But retirement may still be a bottleneck for two reasons. Firstly, instructions must retire in order. If a uop is executed out of order then it cannot retire before all preceding uops in the order have retired. And the second limitation is that taken jumps must retire in the first of the three slots in the retirement station. Just like decoder D1 and D2 can be idle if the next instruction only fits into D0, the last two slots in the retirement station can be idle if the next uop to retire is a taken jump. This is significant if you have a small loop where the number of uops in the loop is not divisible by three.

All uops stay in the reorder buffer (ROB) until they retire. The ROB can hold 40 uops. This sets a limit to the number of instructions that can execute during the long delay of a division or other slow operation. Before the division is finished the ROB will be filled up with executed uops waiting to retire. Only when the division is finished and retired can the subsequent uops begin to retire, because retirement takes place in order.

In case of speculative execution of predicted branches (see page 43) the speculatively executed uops cannot retire until it is certain that the prediction was correct. If the prediction turns out to be wrong then the speculatively executed uops are discarded without retirement.

The following instructions cannot execute speculatively: memory writes, IN, OUT, and serializing instructions.

14.6 Partial register stalls

Partial register stall is a problem that occurs in PPro, P2 and P3 when you write to part of a 32-bit register and later read from the whole register or a bigger part of it. Example:

MOV

AL, BYTE

PTR [M8]

MOV

EBX, EAX

; partial register stall

This gives a delay of 5 - 6 clocks. The reason is that a temporary register has been assigned to AL (to make it independent of AH). The execution unit has to wait until the write to AL has retired before it is possible to combine the value from AL with the value of the rest of EAX. The stall can be avoided by changing to code to:

MOVZX EBX, BYTE PTR [MEM8]

AND EAX, 0FFFFFF00h

OR EBX, EAX

Of course you can also avoid the partial stalls by putting in other instructions after the write to the partial register so that it has time to retire before you read from the full register.

You should be aware of partial stalls whenever you mix different data sizes (8, 16, and 32 bits):

MOV

BH,

0

 

ADD

BX,

AX

; stall

INC EBX

; stall

You don't get a stall when reading a partial register after writing to the full register, or a bigger part of it:

MOV EAX, [MEM32]

 

 

 

ADD BL, AL

; no stall

ADD BH, AH

; no

stall

MOV

CX,

AX

;

no

stall

MOV

DX,

BX

;

stall

The easiest way to avoid partial register stalls is to always use full registers and use MOVZX or MOVSX when reading from smaller memory operands. These instructions are fast on the PPro, P2 and P3, but slow on earlier processors. Therefore, a compromise is offered when you want your code to perform reasonably well on all processors. The replacement for

MOVZX EAX,BYTE PTR [M8] looks like this:

XOR EAX, EAX

MOV AL, BYTE PTR [M8]

The PPro, P2 and P3 processors make a special case out of this combination to avoid a partial register stall when later reading from EAX. The trick is that a register is tagged as empty when it is XOR'ed with itself. The processor remembers that the upper 24 bits ofEAX are zero, so that a partial stall can be avoided. This mechanism works only on certain combinations:

XOR EAX, EAX

 

MOV AL, 3

 

MOV EBX, EAX

; no stall

XOR AH, AH

 

MOV AL, 3

 

MOV BX, AX

; no stall

XOR EAX, EAX

 

MOV AH, 3

 

MOV EBX, EAX

; stall

SUB EBX, EBX

 

MOV BL, DL

 

MOV ECX, EBX

; no stall

MOV EBX, 0

 

MOV BL, DL

 

MOV ECX, EBX

; stall

MOV BL, DL

 

XOR EBX, EBX

; no stall

Setting a register to zero by subtracting it from itself works the same as the XOR, but setting it to zero with the MOV instruction doesn't prevent the stall.

You can set the XOR outside a loop:

XOR EAX, EAX

MOV ECX, 100

LL:MOV AL, [ESI]

MOV [EDI], EAX

; no stall

INC ESI

 

ADD

EDI, 4

 

DEC

ECX

 

JNZ LL

The processor remembers that the upper 24 bits of EAX are zero as long as you don't get an interrupt, misprediction, or other serializing event.

You should remember to neutralize any partial register you have used before calling a subroutine that might push the full register:

ADD BL, AL

 

MOV

[MEM8], BL

 

XOR

EBX, EBX

; neutralize BL

CALL _HighLevelFunction

Most high-level language procedures push EBX at the start of the procedure, and this would generate a partial register stall in the example above if you hadn't neutralizedBL.

Setting a register to zero with the XOR method doesn't break its dependence on earlier instructions on PPro, P2 and P3 (but it does on P4). Example:

DIV EBX

 

MOV [MEM], EAX

 

MOV EAX, 0

; break dependence

XOR EAX, EAX

; prevent partial register stall

MOV AL, CL

 

ADD EBX, EAX

 

Setting EAX to zero twice here seems redundant, but without the MOV EAX,0 the last instructions would have to wait for the slow DIV to finish, and without XOR EAX,EAX you would have a partial register stall.

The FNSTSW AX instruction is special: in 32-bit mode it behaves as if writing to the entire EAX. In fact, it does something like this in 32-bit mode:

AND EAX,0FFFF0000h / FNSTSW TEMP / OR EAX,TEMP

hence, you don't get a partial register stall when readingEAX after this instruction in 32 bit mode:

FNSTSW AX

/

MOV EBX,EAX

;

stall

only

if

16

bit

mode

MOV AX,0

/

FNSTSW AX

;

stall

only

if

32

bit

mode

Partial flags stalls

The flags register can also cause partial register stalls:

CMP EAX, EBX

INC ECX

JBE XX ; partial flags stall

The JBE instruction reads both the carry flag and the zero flag. Since the INC instruction changes the zero flag, but not the carry flag, the JBE instruction has to wait for the two preceding instructions to retire before it can combine the carry flag from the CMP instruction and the zero flag from the INC instruction. This situation is likely to be a bug in the assembly code rather than an intended combination of flags. To correct it, change INC ECX to ADD ECX,1. A similar bug that causes a partial flags stall is SAHF / JL XX. The JL instruction tests the sign flag and the overflow flag, but SAHF doesn't change the overflow flag. To correct it, change JL XX to JS XX.

Unexpectedly (and contrary to what Intel manuals say) you also get a partial flags stall after an instruction that modifies some of the flag bits when reading only unmodified flag bits:

CMP EAX, EBX

INC ECX

JC XX ; partial flags stall

but not when reading only modified bits:

CMP

EAX, EBX

 

INC

ECX

 

JZ

XX

; no stall

Partial flags stalls are likely to occur on instructions that read many or all flags bits, i.e. LAHF, PUSHF, PUSHFD. The following instructions cause partial flags stalls when followed by

LAHF or PUSHF(D): INC, DEC, TEST, bit tests, bit scan, CLC, STC, CMC, CLD, STD, CLI,

STI, MUL, IMUL, and all shifts and rotates. The following instructions do not cause partial flags stalls: AND, OR, XOR, ADD, ADC, SUB, SBB, CMP, NEG. It is strange that TEST and AND behave differently while, by definition, they do exactly the same thing to the flags. You may use a SETcc instruction instead of LAHF or PUSHF(D) for storing the value of a flag in order to avoid a stall.

Examples:

INC EAX

/ PUSHFD

; stall

 

ADD EAX,1 / PUSHFD

; no stall

 

SHR EAX,1 / PUSHFD

; stall

 

SHR EAX,1 / OR

EAX,EAX / PUSHFD

; no stall

TEST

EBX,EBX /

LAHF

; stall

 

AND

EBX,EBX /

LAHF

; no stall

 

TEST EBX,EBX /

SETZ AL

; no stall

 

CLC / SETZ AL

 

; stall

 

CLD / SETZ AL

 

; no stall

 

The penalty for partial flags stalls is approximately 4 clocks.

Flags stalls after shifts and rotates

You can get a stall resembling the partial flags stall when reading any flag bit after a shift or rotate, except for shifts and rotates by one (short form):

SHR EAX,1 / JZ

XX

; no stall

SHR EAX,2 / JZ

XX

; stall

SHR EAX,2 / OR

EAX,EAX / JZ XX

; no stall

SHR EAX,5 / JC

XX

; stall

SHR EAX,4 / SHR EAX,1 / JC XX

; no stall

SHR EAX,CL / JZ XX

; stall, even if CL = 1

SHRD EAX,EBX,1

/ JZ XX

; stall

ROL EBX,8 / JC

XX

; stall

The penalty for these stalls is approximately 4 clocks.

14.7 Partial memory stalls

A partial memory stall is somewhat analogous to a partial register stall. It occurs when you mix data sizes for the same memory address: