- •Introduction
- •Assembly language syntax
- •Microprocessor versions covered by this manual
- •Getting started with optimization
- •Speed versus program clarity and security
- •Choice of programming language
- •Choice of algorithm
- •Memory model
- •Finding the hot spots
- •Literature
- •Optimizing in C++
- •Use optimization options
- •Identify the most critical parts of your code
- •Break dependence chains
- •Use local variables
- •Use array of structures rather than structure of arrays
- •Alignment of data
- •Division
- •Function calls
- •Conversion from floating-point numbers to integers
- •Character arrays versus string objects
- •Combining assembly and high level language
- •Inline assembly
- •Calling conventions
- •Data storage in C++
- •Register usage in 16 bit mode DOS or Windows
- •Register usage in 32 bit Windows
- •Register usage in Linux
- •Making compiler-independent code
- •Adding support for multiple compilers in .asm modules
- •Further compiler incompatibilities
- •Object file formats
- •Using MASM under Linux
- •Object oriented programming
- •Other high level languages
- •Debugging and verifying assembly code
- •Reducing code size
- •Detecting processor type
- •Checking for operating system support for XMM registers
- •Alignment
- •Cache
- •First time versus repeated execution
- •Out-of-order execution (PPro, P2, P3, P4)
- •Instructions are split into uops
- •Register renaming
- •Dependence chains
- •Branch prediction (all processors)
- •Prediction methods for conditional jumps
- •Branch prediction in P1
- •Branch prediction in PMMX, PPro, P2, and P3
- •Branch prediction in P4
- •Indirect jumps (all processors)
- •Returns (all processors except P1)
- •Static prediction
- •Close jumps
- •Avoiding jumps (all processors)
- •Optimizing for P1 and PMMX
- •Pairing integer instructions
- •Address generation interlock
- •Splitting complex instructions into simpler ones
- •Prefixes
- •Scheduling floating-point code
- •Optimizing for PPro, P2, and P3
- •The pipeline in PPro, P2 and P3
- •Register renaming
- •Register read stalls
- •Out of order execution
- •Retirement
- •Partial register stalls
- •Partial memory stalls
- •Bottlenecks in PPro, P2, P3
- •Optimizing for P4
- •Trace cache
- •Instruction decoding
- •Execution units
- •Do the floating-point and MMX units run at half speed?
- •Transfer of data between execution units
- •Retirement
- •Partial registers and partial flags
- •Partial memory access
- •Memory intermediates in dependencies
- •Breaking dependencies
- •Choosing the optimal instructions
- •Bottlenecks in P4
- •Loop optimization (all processors)
- •Loops in P1 and PMMX
- •Loops in PPro, P2, and P3
- •Loops in P4
- •Macro loops (all processors)
- •Single-Instruction-Multiple-Data programming
- •Problematic Instructions
- •XCHG (all processors)
- •Shifts and rotates (P4)
- •Rotates through carry (all processors)
- •String instructions (all processors)
- •Bit test (all processors)
- •Integer multiplication (all processors)
- •Division (all processors)
- •LEA instruction (all processors)
- •WAIT instruction (all processors)
- •FCOM + FSTSW AX (all processors)
- •FPREM (all processors)
- •FRNDINT (all processors)
- •FSCALE and exponential function (all processors)
- •FPTAN (all processors)
- •FSQRT (P3 and P4)
- •FLDCW (PPro, P2, P3, P4)
- •Bit scan (P1 and PMMX)
- •Special topics
- •Freeing floating-point registers (all processors)
- •Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
- •Converting from floating-point to integer (All processors)
- •Using integer instructions for floating-point operations
- •Using floating-point instructions for integer operations
- •Moving blocks of data (All processors)
- •Self-modifying code (All processors)
- •Testing speed
- •List of instruction timings for P1 and PMMX
- •Integer instructions (P1 and PMMX)
- •Floating-point instructions (P1 and PMMX)
- •MMX instructions (PMMX)
- •List of instruction timings and uop breakdown for PPro, P2 and P3
- •Integer instructions (PPro, P2 and P3)
- •Floating-point instructions (PPro, P2 and P3)
- •MMX instructions (P2 and P3)
- •List of instruction timings and uop breakdown for P4
- •integer instructions
- •Floating-point instructions
- •SIMD integer instructions
- •SIMD floating-point instructions
- •Comparison of the different microprocessors
MOV |
BYTE |
PTR [ESI], AL |
|
MOV |
EBX, |
DWORD PTR [ESI] |
; partial memory stall |
Here you get a stall because the processor has to combine the byte written from AL with the next three bytes, which were in memory before, to get the four bytes needed for reading into EBX. The penalty is approximately 7 - 8 clocks.
Unlike the partial register stalls, you also get a partial memory stall when you write a bigger operand to memory and then read part of it, if the smaller part doesn't start at the same address:
MOV DWORD PTR [ESI], EAX |
|
|
||||
MOV |
BL, |
BYTE |
PTR |
[ESI] |
; |
no stall |
MOV |
BH, |
BYTE |
PTR |
[ESI+1] |
; |
stall |
You can avoid this stall by changing the last line to MOV BH,AH, but such a solution is not possible in a situation like this:
FISTP QWORD PTR [EDI]
MOV |
EAX, |
DWORD |
PTR |
[EDI] |
|
MOV |
EDX, |
DWORD |
PTR |
[EDI+4] |
; stall |
Interestingly, you can also get a partial memory stall when writing and reading completely different addresses if they happen to have the same set-value in different cache banks:
MOV BYTE PTR [ESI], AL |
|
|
||||
MOV |
EBX, |
DWORD |
PTR |
[ESI+4092] |
; |
no stall |
MOV |
ECX, |
DWORD |
PTR |
[ESI+4096] |
; |
stall |
14.8 Bottlenecks in PPro, P2, P3
When optimizing code for these processors, it is important to analyze where the bottlenecks are. Spending time on optimizing away one bottleneck doesn't make sense if another bottleneck is narrower.
If you expect code cache misses, then you should restructure your code to keep the most used parts of code together.
If you expect many data cache misses, then forget about everything else and concentrate on how to restructure your data to reduce the number of cache misses (page 29), and avoid long dependence chains after a data read cache miss.
If you have many divisions, then try to reduce them (page 116) and make sure the processor has something else to do during the divisions.
Dependence chains tend to hamper out-of-order execution (page 34). Try to break long dependence chains, especially if they contain slow instructions such as multiplication, division, and floating-point instructions.
If you have many jumps, calls, or returns, and especially if the jumps are poorly predictable, then try if some of them can be avoided. Replace poorly predictable conditional jumps with conditional moves if possible, and replace small procedures with macros (page 50).
If you are mixing different data sizes (8, 16, and 32 bit integers) then look out for partial stalls. If you use PUSHF or LAHF instructions then look out for partial flags stalls. Avoid testing flags after shifts or rotates by more than 1 (page 71).