- •Introduction
- •Assembly language syntax
- •Microprocessor versions covered by this manual
- •Getting started with optimization
- •Speed versus program clarity and security
- •Choice of programming language
- •Choice of algorithm
- •Memory model
- •Finding the hot spots
- •Literature
- •Optimizing in C++
- •Use optimization options
- •Identify the most critical parts of your code
- •Break dependence chains
- •Use local variables
- •Use array of structures rather than structure of arrays
- •Alignment of data
- •Division
- •Function calls
- •Conversion from floating-point numbers to integers
- •Character arrays versus string objects
- •Combining assembly and high level language
- •Inline assembly
- •Calling conventions
- •Data storage in C++
- •Register usage in 16 bit mode DOS or Windows
- •Register usage in 32 bit Windows
- •Register usage in Linux
- •Making compiler-independent code
- •Adding support for multiple compilers in .asm modules
- •Further compiler incompatibilities
- •Object file formats
- •Using MASM under Linux
- •Object oriented programming
- •Other high level languages
- •Debugging and verifying assembly code
- •Reducing code size
- •Detecting processor type
- •Checking for operating system support for XMM registers
- •Alignment
- •Cache
- •First time versus repeated execution
- •Out-of-order execution (PPro, P2, P3, P4)
- •Instructions are split into uops
- •Register renaming
- •Dependence chains
- •Branch prediction (all processors)
- •Prediction methods for conditional jumps
- •Branch prediction in P1
- •Branch prediction in PMMX, PPro, P2, and P3
- •Branch prediction in P4
- •Indirect jumps (all processors)
- •Returns (all processors except P1)
- •Static prediction
- •Close jumps
- •Avoiding jumps (all processors)
- •Optimizing for P1 and PMMX
- •Pairing integer instructions
- •Address generation interlock
- •Splitting complex instructions into simpler ones
- •Prefixes
- •Scheduling floating-point code
- •Optimizing for PPro, P2, and P3
- •The pipeline in PPro, P2 and P3
- •Register renaming
- •Register read stalls
- •Out of order execution
- •Retirement
- •Partial register stalls
- •Partial memory stalls
- •Bottlenecks in PPro, P2, P3
- •Optimizing for P4
- •Trace cache
- •Instruction decoding
- •Execution units
- •Do the floating-point and MMX units run at half speed?
- •Transfer of data between execution units
- •Retirement
- •Partial registers and partial flags
- •Partial memory access
- •Memory intermediates in dependencies
- •Breaking dependencies
- •Choosing the optimal instructions
- •Bottlenecks in P4
- •Loop optimization (all processors)
- •Loops in P1 and PMMX
- •Loops in PPro, P2, and P3
- •Loops in P4
- •Macro loops (all processors)
- •Single-Instruction-Multiple-Data programming
- •Problematic Instructions
- •XCHG (all processors)
- •Shifts and rotates (P4)
- •Rotates through carry (all processors)
- •String instructions (all processors)
- •Bit test (all processors)
- •Integer multiplication (all processors)
- •Division (all processors)
- •LEA instruction (all processors)
- •WAIT instruction (all processors)
- •FCOM + FSTSW AX (all processors)
- •FPREM (all processors)
- •FRNDINT (all processors)
- •FSCALE and exponential function (all processors)
- •FPTAN (all processors)
- •FSQRT (P3 and P4)
- •FLDCW (PPro, P2, P3, P4)
- •Bit scan (P1 and PMMX)
- •Special topics
- •Freeing floating-point registers (all processors)
- •Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
- •Converting from floating-point to integer (All processors)
- •Using integer instructions for floating-point operations
- •Using floating-point instructions for integer operations
- •Moving blocks of data (All processors)
- •Self-modifying code (All processors)
- •Testing speed
- •List of instruction timings for P1 and PMMX
- •Integer instructions (P1 and PMMX)
- •Floating-point instructions (P1 and PMMX)
- •MMX instructions (PMMX)
- •List of instruction timings and uop breakdown for PPro, P2 and P3
- •Integer instructions (PPro, P2 and P3)
- •Floating-point instructions (PPro, P2 and P3)
- •MMX instructions (P2 and P3)
- •List of instruction timings and uop breakdown for P4
- •integer instructions
- •Floating-point instructions
- •SIMD integer instructions
- •SIMD floating-point instructions
- •Comparison of the different microprocessors
24 Comparison of the different microprocessors
The following table summarizes some important differences between the microprocessors in the Pentium family:
|
P1 |
PMMX |
PPro |
P2 |
P3 |
P4 |
code cache, kb |
8 |
16 |
8 |
16 |
16 |
≈ 60 |
code cache associativity, ways |
2 |
4 |
4 |
4 |
4 |
4 |
data cache, kb |
8 |
16 |
8 |
16 |
16 |
8 |
data cache associativity, ways |
2 |
4 |
2 |
4 |
4 |
4 |
data cache line size |
32 |
32 |
32 |
32 |
32 |
64 |
built-in level 2 cache, kb |
0 |
0 |
256 *) |
256 *) |
256 *) |
256 *) |
level 2 cache associativity, ways |
0 |
0 |
4 |
4 |
8 |
8 |
level 2 cache bus size, bits |
0 |
0 |
64 |
64 |
256 |
256 |
MMX instructions |
no |
yes |
no |
yes |
yes |
yes |
XMM instructions |
no |
no |
no |
no |
yes |
yes |
conditional move instructions |
no |
no |
yes |
yes |
yes |
yes |
out of order execution |
no |
no |
yes |
yes |
yes |
yes |
branch prediction |
poor |
good |
good |
good |
good |
good |
branch target buffer entries |
256 |
256 |
512 |
512 |
512 |
4096 |
return stack buffer size |
0 |
4 |
16 |
16 |
16 |
16 |
branch misprediction penalty |
3-4 |
4-5 |
10-20 |
10-20 |
10-20 |
≥ 24 |
partial register stall |
0 |
0 |
5 |
5 |
5 |
0 |
FMUL latency |
3 |
3 |
5 |
5 |
5 |
6-7 |
FMUL reciprocal throughput |
2 |
2 |
2 |
2 |
2 |
1 |
IMUL latency |
9 |
9 |
4 |
4 |
4 |
14 |
IMUL reciprocal throughput |
9 |
9 |
1 |
1 |
1 |
5-10 |
*) Celeron: 0-128, Xeon: 512 or more, many other variants available. On some versions the level 2 cache runs at half speed.