- •Introduction
- •Assembly language syntax
- •Microprocessor versions covered by this manual
- •Getting started with optimization
- •Speed versus program clarity and security
- •Choice of programming language
- •Choice of algorithm
- •Memory model
- •Finding the hot spots
- •Literature
- •Optimizing in C++
- •Use optimization options
- •Identify the most critical parts of your code
- •Break dependence chains
- •Use local variables
- •Use array of structures rather than structure of arrays
- •Alignment of data
- •Division
- •Function calls
- •Conversion from floating-point numbers to integers
- •Character arrays versus string objects
- •Combining assembly and high level language
- •Inline assembly
- •Calling conventions
- •Data storage in C++
- •Register usage in 16 bit mode DOS or Windows
- •Register usage in 32 bit Windows
- •Register usage in Linux
- •Making compiler-independent code
- •Adding support for multiple compilers in .asm modules
- •Further compiler incompatibilities
- •Object file formats
- •Using MASM under Linux
- •Object oriented programming
- •Other high level languages
- •Debugging and verifying assembly code
- •Reducing code size
- •Detecting processor type
- •Checking for operating system support for XMM registers
- •Alignment
- •Cache
- •First time versus repeated execution
- •Out-of-order execution (PPro, P2, P3, P4)
- •Instructions are split into uops
- •Register renaming
- •Dependence chains
- •Branch prediction (all processors)
- •Prediction methods for conditional jumps
- •Branch prediction in P1
- •Branch prediction in PMMX, PPro, P2, and P3
- •Branch prediction in P4
- •Indirect jumps (all processors)
- •Returns (all processors except P1)
- •Static prediction
- •Close jumps
- •Avoiding jumps (all processors)
- •Optimizing for P1 and PMMX
- •Pairing integer instructions
- •Address generation interlock
- •Splitting complex instructions into simpler ones
- •Prefixes
- •Scheduling floating-point code
- •Optimizing for PPro, P2, and P3
- •The pipeline in PPro, P2 and P3
- •Register renaming
- •Register read stalls
- •Out of order execution
- •Retirement
- •Partial register stalls
- •Partial memory stalls
- •Bottlenecks in PPro, P2, P3
- •Optimizing for P4
- •Trace cache
- •Instruction decoding
- •Execution units
- •Do the floating-point and MMX units run at half speed?
- •Transfer of data between execution units
- •Retirement
- •Partial registers and partial flags
- •Partial memory access
- •Memory intermediates in dependencies
- •Breaking dependencies
- •Choosing the optimal instructions
- •Bottlenecks in P4
- •Loop optimization (all processors)
- •Loops in P1 and PMMX
- •Loops in PPro, P2, and P3
- •Loops in P4
- •Macro loops (all processors)
- •Single-Instruction-Multiple-Data programming
- •Problematic Instructions
- •XCHG (all processors)
- •Shifts and rotates (P4)
- •Rotates through carry (all processors)
- •String instructions (all processors)
- •Bit test (all processors)
- •Integer multiplication (all processors)
- •Division (all processors)
- •LEA instruction (all processors)
- •WAIT instruction (all processors)
- •FCOM + FSTSW AX (all processors)
- •FPREM (all processors)
- •FRNDINT (all processors)
- •FSCALE and exponential function (all processors)
- •FPTAN (all processors)
- •FSQRT (P3 and P4)
- •FLDCW (PPro, P2, P3, P4)
- •Bit scan (P1 and PMMX)
- •Special topics
- •Freeing floating-point registers (all processors)
- •Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
- •Converting from floating-point to integer (All processors)
- •Using integer instructions for floating-point operations
- •Using floating-point instructions for integer operations
- •Moving blocks of data (All processors)
- •Self-modifying code (All processors)
- •Testing speed
- •List of instruction timings for P1 and PMMX
- •Integer instructions (P1 and PMMX)
- •Floating-point instructions (P1 and PMMX)
- •MMX instructions (PMMX)
- •List of instruction timings and uop breakdown for PPro, P2 and P3
- •Integer instructions (PPro, P2 and P3)
- •Floating-point instructions (PPro, P2 and P3)
- •MMX instructions (P2 and P3)
- •List of instruction timings and uop breakdown for P4
- •integer instructions
- •Floating-point instructions
- •SIMD integer instructions
- •SIMD floating-point instructions
- •Comparison of the different microprocessors
of a critical dependence chain, especially on the P4 processor. There are several alternative ways to avoid this:
1.keep the most critical dependence chain entirely inside one function
2.use inline functions. An inline function will be expanded like a macro without parameter transfer, if possible.
3.use #define macros with parameters instead of functions.
But beware that macro parameters are evaluated every time they are used.
Example:
#define max(a,b) (a > b ? a : b) y = max(sin(x),cos(x));
In this example, sin(x) and cos(x) are both calculated twice because the macro is referencing them twice. This is certainly not optimal.
4.declare functions __fastcall. The first two or three (depending on compiler) integer parameters will be transferred in registers rather than on the stack when the function is declared __fastcall. Floating-point parameters are always stored on the stack. The implicit 'this' pointer in member functions (methods) is also treated like a parameter, so there may be only one free register left for transferring your parameters. Therefore, make sure that the most critical integer parameter comes first when you are using __fastcall.
5.declare functions static. Static functions have no external linkage. This enables the compiler to optimize across function calls.
Your compiler may ignore the optimization hints given by the inline and static modifiers, while __fastcall is certain to have an effect on the first one or two integer parameters. Using #define macros is likely to have an effect on floating-point parameters as well.
If a large object is transferred to a function as a parameter, then the entire object is copied. The copy constructor is called if there is one. If copying the object is not necessary for the logic of your algorithm, then you can save time by transferring a pointer or reference to the object rather than a copy of the object, or by making the function a member of the object's class. Whether you choose to use a pointer, a reference, or a member function, is a matter of programming style. All three methods will produce the same compiled code, which will be more efficient than copying the object. In general, pointers, references, and member functions are quite efficient. Feel free to use them whenever it is useful for the logic structure of your program. Function pointers and virtual functions are somewhat less efficient.
3.9 Conversion from floating-point numbers to integers
According to the standards for the C++ language, all conversions from floating-point numbers to integers use truncation towards zero, rather than rounding. This is unfortunate because truncation takes much longer time than rounding on most microprocessors. It is beyond my comprehension why there is no round function in standard C++ libraries. If you cannot avoid conversions from float or double to int in the critical part of your code, then you may make your own round function using assembly language:
inline int round (double x) { int n;
__asm fld x; __asm fistp n; return n;}