Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf
Скачиваний:
12
Добавлен:
23.08.2013
Размер:
814.91 Кб
Скачать

MOV

BYTE

PTR [ESI], AL

 

MOV

EBX,

DWORD PTR [ESI]

; partial memory stall

Here you get a stall because the processor has to combine the byte written from AL with the next three bytes, which were in memory before, to get the four bytes needed for reading into EBX. The penalty is approximately 7 - 8 clocks.

Unlike the partial register stalls, you also get a partial memory stall when you write a bigger operand to memory and then read part of it, if the smaller part doesn't start at the same address:

MOV DWORD PTR [ESI], EAX

 

 

MOV

BL,

BYTE

PTR

[ESI]

;

no stall

MOV

BH,

BYTE

PTR

[ESI+1]

;

stall

You can avoid this stall by changing the last line to MOV BH,AH, but such a solution is not possible in a situation like this:

FISTP QWORD PTR [EDI]

MOV

EAX,

DWORD

PTR

[EDI]

 

MOV

EDX,

DWORD

PTR

[EDI+4]

; stall

Interestingly, you can also get a partial memory stall when writing and reading completely different addresses if they happen to have the same set-value in different cache banks:

MOV BYTE PTR [ESI], AL

 

 

MOV

EBX,

DWORD

PTR

[ESI+4092]

;

no stall

MOV

ECX,

DWORD

PTR

[ESI+4096]

;

stall

14.8 Bottlenecks in PPro, P2, P3

When optimizing code for these processors, it is important to analyze where the bottlenecks are. Spending time on optimizing away one bottleneck doesn't make sense if another bottleneck is narrower.

If you expect code cache misses, then you should restructure your code to keep the most used parts of code together.

If you expect many data cache misses, then forget about everything else and concentrate on how to restructure your data to reduce the number of cache misses (page 29), and avoid long dependence chains after a data read cache miss.

If you have many divisions, then try to reduce them (page 116) and make sure the processor has something else to do during the divisions.

Dependence chains tend to hamper out-of-order execution (page 34). Try to break long dependence chains, especially if they contain slow instructions such as multiplication, division, and floating-point instructions.

If you have many jumps, calls, or returns, and especially if the jumps are poorly predictable, then try if some of them can be avoided. Replace poorly predictable conditional jumps with conditional moves if possible, and replace small procedures with macros (page 50).

If you are mixing different data sizes (8, 16, and 32 bit integers) then look out for partial stalls. If you use PUSHF or LAHF instructions then look out for partial flags stalls. Avoid testing flags after shifts or rotates by more than 1 (page 71).