- •Introduction
- •Assembly language syntax
- •Microprocessor versions covered by this manual
- •Getting started with optimization
- •Speed versus program clarity and security
- •Choice of programming language
- •Choice of algorithm
- •Memory model
- •Finding the hot spots
- •Literature
- •Optimizing in C++
- •Use optimization options
- •Identify the most critical parts of your code
- •Break dependence chains
- •Use local variables
- •Use array of structures rather than structure of arrays
- •Alignment of data
- •Division
- •Function calls
- •Conversion from floating-point numbers to integers
- •Character arrays versus string objects
- •Combining assembly and high level language
- •Inline assembly
- •Calling conventions
- •Data storage in C++
- •Register usage in 16 bit mode DOS or Windows
- •Register usage in 32 bit Windows
- •Register usage in Linux
- •Making compiler-independent code
- •Adding support for multiple compilers in .asm modules
- •Further compiler incompatibilities
- •Object file formats
- •Using MASM under Linux
- •Object oriented programming
- •Other high level languages
- •Debugging and verifying assembly code
- •Reducing code size
- •Detecting processor type
- •Checking for operating system support for XMM registers
- •Alignment
- •Cache
- •First time versus repeated execution
- •Out-of-order execution (PPro, P2, P3, P4)
- •Instructions are split into uops
- •Register renaming
- •Dependence chains
- •Branch prediction (all processors)
- •Prediction methods for conditional jumps
- •Branch prediction in P1
- •Branch prediction in PMMX, PPro, P2, and P3
- •Branch prediction in P4
- •Indirect jumps (all processors)
- •Returns (all processors except P1)
- •Static prediction
- •Close jumps
- •Avoiding jumps (all processors)
- •Optimizing for P1 and PMMX
- •Pairing integer instructions
- •Address generation interlock
- •Splitting complex instructions into simpler ones
- •Prefixes
- •Scheduling floating-point code
- •Optimizing for PPro, P2, and P3
- •The pipeline in PPro, P2 and P3
- •Register renaming
- •Register read stalls
- •Out of order execution
- •Retirement
- •Partial register stalls
- •Partial memory stalls
- •Bottlenecks in PPro, P2, P3
- •Optimizing for P4
- •Trace cache
- •Instruction decoding
- •Execution units
- •Do the floating-point and MMX units run at half speed?
- •Transfer of data between execution units
- •Retirement
- •Partial registers and partial flags
- •Partial memory access
- •Memory intermediates in dependencies
- •Breaking dependencies
- •Choosing the optimal instructions
- •Bottlenecks in P4
- •Loop optimization (all processors)
- •Loops in P1 and PMMX
- •Loops in PPro, P2, and P3
- •Loops in P4
- •Macro loops (all processors)
- •Single-Instruction-Multiple-Data programming
- •Problematic Instructions
- •XCHG (all processors)
- •Shifts and rotates (P4)
- •Rotates through carry (all processors)
- •String instructions (all processors)
- •Bit test (all processors)
- •Integer multiplication (all processors)
- •Division (all processors)
- •LEA instruction (all processors)
- •WAIT instruction (all processors)
- •FCOM + FSTSW AX (all processors)
- •FPREM (all processors)
- •FRNDINT (all processors)
- •FSCALE and exponential function (all processors)
- •FPTAN (all processors)
- •FSQRT (P3 and P4)
- •FLDCW (PPro, P2, P3, P4)
- •Bit scan (P1 and PMMX)
- •Special topics
- •Freeing floating-point registers (all processors)
- •Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
- •Converting from floating-point to integer (All processors)
- •Using integer instructions for floating-point operations
- •Using floating-point instructions for integer operations
- •Moving blocks of data (All processors)
- •Self-modifying code (All processors)
- •Testing speed
- •List of instruction timings for P1 and PMMX
- •Integer instructions (P1 and PMMX)
- •Floating-point instructions (P1 and PMMX)
- •MMX instructions (PMMX)
- •List of instruction timings and uop breakdown for PPro, P2 and P3
- •Integer instructions (PPro, P2 and P3)
- •Floating-point instructions (PPro, P2 and P3)
- •MMX instructions (P2 and P3)
- •List of instruction timings and uop breakdown for P4
- •integer instructions
- •Floating-point instructions
- •SIMD integer instructions
- •SIMD floating-point instructions
- •Comparison of the different microprocessors
goes backwards (e.g. a loop). Static prediction takes longer time than dynamic prediction on these processors.
On the P4, you can change the static prediction by adding prediction hint prefixes. The prefix 3EH will make the branch predicted taken the first time, and prefix 2EH will make it predicted not taken the first time. These prefixes can be coded in this way:
DB |
3EH |
; |
prediction hint |
prefix |
JBE |
LL |
; |
predicted taken |
first time |
The prediction hint prefixes are in fact segment prefixes, which have no effect and cause no harm on previous processors.
It is rarely worth the effort to take static prediction into account. Almost any branch that is executed sufficiently often for its timing to have any significant effect is likely to stay in the BTB so that only the dynamic prediction counts. Static prediction only has a significant effect if context switches or task switches occur very often.
Normally you don't have to care about the penalty of static mispredictions. It is more important to organize branches so that the most common path is not taken, because this improves code prefetching, trace cache use, and retirement.
Static prediction does have an influence on the way traces are organized in the trace cache, but this is not a lasting effect because traces may be reorganized after several iterations.
12.8 Close jumps
Close jumps on PMMX
On the PMMX, there is a risk that two control transfer instructions will share the same BTB entry if they are too close to each other. The obvious result is that they will always be mispredicted. The BTB entry for a control transfer instruction is identified by bits 2-31 of the address of the last byte in the instruction. If two control transfer instructions are so close together that they differ only in bits 0-1 of the address, then we have the problem of a shared BTB entry. The RET instruction is particularly prone to this problem because it is only one byte long. There are various ways to solve this problem:
1.Move the code sequence a little up or down in memory so that you get a DWORD boundary between the two addresses.
2.Change a short jump to a near jump (with 4 bytes displacement) so that the end of the instruction is moved further down. There is no way you can force the assembler to use anything but the shortest form of an instruction so you have to hard-code the near jump if you choose this solution.
3.Put in some instruction between the two control transfer instructions. This is the easiest method, and the only method if you don't know where DWORD boundaries are because your segment is not DWORD aligned or because the code keeps moving up and down as you make changes in the preceding code.
There is a penalty when the first instruction pair following the target label of a call contains another call instruction or if a return follows immediately after another return.
The penalty for chained calls only occurs when the same subroutines are called from more than one location. Chained returns always have a penalty. There is sometimes a small stall for a jump after a call, but no penalty for return after call; call after return; jump, call, or return after jump; or jump after return.
Chained jumps on PPro, P2 and P3
A jump, call, or return cannot be executed in the first clock cycle after a previous jump, call, or return on the PPro, P2 and P3. Therefore, chained jumps will take two clock cycles for each jump, and you may want to make sure that the processor has something else to do in parallel. For the same reason, a loop will take at least two clock cycles per iteration on these processors.
Chained jumps on P4
The retirement station can handle only one taken jump, call or return per clock cycle, and only in the first of the three retirement slots. Therefore, preferably, no more than every third uop should be a jump.
12.9 Avoiding jumps (all processors)
There can be many reasons why you may want to reduce the number of branches, jumps, calls and returns:
•jump mispredictions are very expensive.
•jump instructions may push one another out of the branch target buffer.
•on the P4, branches may interfere with each other in the global pattern history table.
•on the P4, branches fill up the global branch history register. This may reduce the predictability of subsequent branches.
•there are various penalties for consecutive or chained jumps, depending on the processor.
•a return takes 2 clocks on P1 and PMMX, calls and returns generate 3 - 4 uops on PPro, P2, P3 and P4.
•on PPro, P2 and P3, instruction fetch may be delayed after a jump.
•on PPro, P2, P3 and P4, retirement is less effective for taken jumps then for other uops.
•on P4, the utilization of the trace cache and the delivery from the trace cache is less effective if the code contains many branches.
Eliminating unconditional jumps and calls
Calls and returns can be avoided by replacing small procedures with inline macros. And in many cases it is possible to reduce the number of jumps by restructuring the code. For example, a jump to a jump should be replaced by a jump to the final target. In some cases this is even possible with conditional jumps if the condition is the same or is known. A jump to a return can be replaced by a return. If you want to eliminate a return to a return, then you should not manipulate the stack pointer because this would interfere with the prediction mechanism of the return stack buffer. Instead, you can replace the preceding call with a jump. For example CALL MYPROC / RET can be replaced by JMP MYPROC if MYPROC ends with the same kind of RET.
You may also eliminate a jump by duplicating the code jumped to. This can be useful if you have a two-way branch inside a loop or before a return. Example:
A: CMP [EAX+4*EDX],ECX
JE B
CALL X
JMP C
B: |
CALL |
Y |
C: |
ADD |
EDX, 1 |
|
JNZ |
A |
|
MOV |
ESP, EBP |
|
POP |
EBP |
|
RET |
|
The jump to C may be eliminated by duplicating the loop epilog:
A: |
CMP |
[EAX+4*EDX],ECX |
|
JE |
B |
|
CALL |
X |
|
ADD |
EDX, 1 |
|
JNZ |
A |
|
JMP |
D |
B: |
CALL |
Y |
C: |
ADD |
EDX, 1 |
|
JNZ |
A |
D: |
MOV |
ESP, EBP |
|
POP |
EBP |
|
RET |
|
The most often executed branch should come first here. The jump to D is outside the loop and therefore less critical. If this jump is executed so often that it needs optimizing too, then replace it with the three instructions following D.
Tricks to avoid conditional jumps (all processors)
The most important jumps to eliminate are conditional jumps, especially if they are poorly predictable. Sometimes it is possible to obtain the same effect as a branch by ingenious manipulation of bits and flags. For example you may calculate the absolute value of a signed integer without branching:
CDQ |
|
; copy sign of EAX to all bits of EDX |
|
XOR EAX,EDX |
; |
toggle all bits if negative |
|
SUB |
EAX,EDX |
; |
add 1 if negative |
The carry flag is particularly useful for this kind of tricks: Setting carry if a value is zero: CMP [VALUE],1
Setting carry if a value is not zero: SUB EAX,EAX / CMP EAX,[VALUE] Incrementing a counter if carry: ADC EAX,0
Setting a bit for each time the carry is set: RCL EAX,1 Generating a bit mask if carry is set: SBB EAX,EAX Setting a bit on an arbitrary condition: SETcond AL
Setting all bits on an arbitrary condition: SUB EAX,EAX / SETcond AL / NEG EAX
The following example finds the minimum of two unsigned numbers: if (b > a) b = a;
SUB EAX,EBX
SBB EDX,EDX
AND EDX,EAX
ADD EBX,EDX
Or, for signed numbers:
SUB EAX,EBX
CDQ
AND EDX,EAX
ADD EBX,EDX
The next example chooses between two numbers: if (a < 0) d = b; else d = c;