- •Introduction
- •Assembly language syntax
- •Microprocessor versions covered by this manual
- •Getting started with optimization
- •Speed versus program clarity and security
- •Choice of programming language
- •Choice of algorithm
- •Memory model
- •Finding the hot spots
- •Literature
- •Optimizing in C++
- •Use optimization options
- •Identify the most critical parts of your code
- •Break dependence chains
- •Use local variables
- •Use array of structures rather than structure of arrays
- •Alignment of data
- •Division
- •Function calls
- •Conversion from floating-point numbers to integers
- •Character arrays versus string objects
- •Combining assembly and high level language
- •Inline assembly
- •Calling conventions
- •Data storage in C++
- •Register usage in 16 bit mode DOS or Windows
- •Register usage in 32 bit Windows
- •Register usage in Linux
- •Making compiler-independent code
- •Adding support for multiple compilers in .asm modules
- •Further compiler incompatibilities
- •Object file formats
- •Using MASM under Linux
- •Object oriented programming
- •Other high level languages
- •Debugging and verifying assembly code
- •Reducing code size
- •Detecting processor type
- •Checking for operating system support for XMM registers
- •Alignment
- •Cache
- •First time versus repeated execution
- •Out-of-order execution (PPro, P2, P3, P4)
- •Instructions are split into uops
- •Register renaming
- •Dependence chains
- •Branch prediction (all processors)
- •Prediction methods for conditional jumps
- •Branch prediction in P1
- •Branch prediction in PMMX, PPro, P2, and P3
- •Branch prediction in P4
- •Indirect jumps (all processors)
- •Returns (all processors except P1)
- •Static prediction
- •Close jumps
- •Avoiding jumps (all processors)
- •Optimizing for P1 and PMMX
- •Pairing integer instructions
- •Address generation interlock
- •Splitting complex instructions into simpler ones
- •Prefixes
- •Scheduling floating-point code
- •Optimizing for PPro, P2, and P3
- •The pipeline in PPro, P2 and P3
- •Register renaming
- •Register read stalls
- •Out of order execution
- •Retirement
- •Partial register stalls
- •Partial memory stalls
- •Bottlenecks in PPro, P2, P3
- •Optimizing for P4
- •Trace cache
- •Instruction decoding
- •Execution units
- •Do the floating-point and MMX units run at half speed?
- •Transfer of data between execution units
- •Retirement
- •Partial registers and partial flags
- •Partial memory access
- •Memory intermediates in dependencies
- •Breaking dependencies
- •Choosing the optimal instructions
- •Bottlenecks in P4
- •Loop optimization (all processors)
- •Loops in P1 and PMMX
- •Loops in PPro, P2, and P3
- •Loops in P4
- •Macro loops (all processors)
- •Single-Instruction-Multiple-Data programming
- •Problematic Instructions
- •XCHG (all processors)
- •Shifts and rotates (P4)
- •Rotates through carry (all processors)
- •String instructions (all processors)
- •Bit test (all processors)
- •Integer multiplication (all processors)
- •Division (all processors)
- •LEA instruction (all processors)
- •WAIT instruction (all processors)
- •FCOM + FSTSW AX (all processors)
- •FPREM (all processors)
- •FRNDINT (all processors)
- •FSCALE and exponential function (all processors)
- •FPTAN (all processors)
- •FSQRT (P3 and P4)
- •FLDCW (PPro, P2, P3, P4)
- •Bit scan (P1 and PMMX)
- •Special topics
- •Freeing floating-point registers (all processors)
- •Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
- •Converting from floating-point to integer (All processors)
- •Using integer instructions for floating-point operations
- •Using floating-point instructions for integer operations
- •Moving blocks of data (All processors)
- •Self-modifying code (All processors)
- •Testing speed
- •List of instruction timings for P1 and PMMX
- •Integer instructions (P1 and PMMX)
- •Floating-point instructions (P1 and PMMX)
- •MMX instructions (PMMX)
- •List of instruction timings and uop breakdown for PPro, P2 and P3
- •Integer instructions (PPro, P2 and P3)
- •Floating-point instructions (PPro, P2 and P3)
- •MMX instructions (P2 and P3)
- •List of instruction timings and uop breakdown for P4
- •integer instructions
- •Floating-point instructions
- •SIMD integer instructions
- •SIMD floating-point instructions
- •Comparison of the different microprocessors
On P2 and P3 it is advantageous to use MMX registers for moving 8 bytes at a time if the above conditions are not met and the destination is likely to be in the level 1 cache. The loop may be rolled out by two.
On the P3, the fastest way of moving data is to use the MOVAPS instruction if the conditions on page 114 are not met or if the destination is in the level 1 or level 2 cache:
|
SUB |
EDI, ESI |
TOP: |
MOVAPS |
XMM0, [ESI] |
|
MOVAPS |
[ESI+EDI], XMM0 |
|
ADD |
ESI, 16 |
|
DEC |
ECX |
|
JNZ |
TOP |
On the P3 you also have the option of writing directly to RAM memory without involving the cache by using the MOVNTQ or MOVNTPS instruction. This can be useful if you don't want the destination to go into a cache. MOVNTPS is only slightly faster than MOVNTQ.
On the P4, the fastest way of moving blocks of data is to use MOVDQA. You may use MOVNTDQ if you don't want the destination to be cached, butMOVDQA is often faster. REP MOVSD may still be the best choice for small blocks of data if the block size is varying and a loop would suffer a branch misprediction.
For further advices on improving memory access see the Intel Pentium 4 and Intel Xeon Processor Optimization Reference Manual.
19.7 Self-modifying code (All processors)
The penalty for executing a piece of code immediately after modifying it is approximately 19 clocks for P1, 31 for PMMX, and 150-300 for PPro, P2 and P3. The P4 will purge the entire trace cache after self-modifying code. The 80486 and earlier processors require a jump between the modifying and the modified code in order to flush the code cache.
To get permission to modify code in a protected operating system you need to call special system functions: In 16-bit Windows call ChangeSelector; in 32-bit Windows call
VirtualProtect and FlushInstructionCache (or put the code in a data segment).
Self-modifying code is not considered good programming practice. It should only be used if the gain in speed is substantial and the modified code is executed so many times that the advantage outweighs the penalties for using self-modifying code.
20 Testing speed
The microprocessors in the Pentium family have an internal 64-bit clock counter which can be read into EDX:EAX using the instruction RDTSC (read time stamp counter). This is very useful for measuring exactly how many clock cycles a piece of code takes.
On the PPro, P2, P3 and P4 processors, you have to insert XOR EAX,EAX / CPUID before and after each RDTSC to prevent it from executing in parallel with anything else. CPUID is a serializing instruction, which means that it flushes the pipeline and waits for all pending operations to finish before proceeding. This is very useful for testing purposes.
The RDTSC instruction cannot execute in virtual mode on the P1 and PMMX, so if you are testing DOS programs on these processors you must run in real mode.
The biggest problem when counting clock ticks is to avoid interrupts. Protected operating systems may not allow you to clear the interrupt flag, so you cannot avoid interrupts and task switches during the test. There are several alternative ways to overcome this problem:
1.Run the test code with a high priority to minimize the risk of interrupts and task switches.
2.If the piece of code you are testing is relatively short then you may repeat the test several times and assume that the lowest of the clock counts measured represents a situation where no interrupt has occurred.
3.If the piece of code you are testing takes so long time that interrupts are unavoidable then you may repeat the test many times and take the average of the clock count measurements.
4.Make a virtual device driver to clear the interrupt flag.
5.Use an operating system that allows clearing the interrupt flag (e.g. Windows 98 without network, in console mode).
6.Start the test program in real mode using the old DOS operating system.
My test programs use method 1, 2, 5 and 6. These programs are available at www.agner.org/assem/testp.zip. The test programs that use method 6 set up a segment descriptor table and switch to 32-bit protected mode with the interrupt flag cleared. You can insert the code you want to test into these test programs. You need a bootable disk with Windows 98 or earlier to get access to run the test programs in real mode.
Remember when you are measuring clock ticks that a piece of code always takes longer time the first few times it is executed where it is not in the code cache or trace cache. Furthermore, it may take three iterations before the branch predictor has adapted to the code.
The alignment effects on the PPro, P2 and P3 processors make time measurements very difficult on these processors. Assume that you have a piece code and you want to make a change which you expect to make the code a few clocks faster. The modified code does not have exactly the same size as the original. This means that the code below the modification will be aligned differently and the instruction fetch blocks will be different. If instruction fetch and decoding is a bottleneck, which is often the case on these processors, then the change in the alignment may make the code several clock cycles faster or slower. The change in the alignment may actually have a larger effect on the clock count than the modification you have made. So you may be unable to verify whether the modification in itself makes the code faster or slower. It can be quite difficult to predict where each instruction fetch block begins, as explained on page 62.
The P1, PMMX and P4 processors do not have these alignment problems. The P4 does, however, have a somewhat similar, though less severe, effect. This effect is caused by changes in the alignment of uops in the trace cache. The time it takes to jump to the least common (but predicted) branch after a conditional jump instruction may differ by up to two clock cycles on different alignments if trace cache delivery is the bottleneck. The alignment of uops in the trace cache lines is difficult to predict (see page 79).
The processors in the Pentium family have special performance monitor counters which can count events such as cache misses, misalignments, branch mispredictions, etc. You need privileged access to set up these counters. The performance monitor counters are model specific. This means that you must use a different test setup for each microprocessor model. Details about how to use the performance monitor counters can be found in Intel's Software Developer's Manuals.
The test programs at www.agner.org/assem/testp.zip give access to the performance monitor counters when run under real mode DOS.