- •Introduction
- •Assembly language syntax
- •Microprocessor versions covered by this manual
- •Getting started with optimization
- •Speed versus program clarity and security
- •Choice of programming language
- •Choice of algorithm
- •Memory model
- •Finding the hot spots
- •Literature
- •Optimizing in C++
- •Use optimization options
- •Identify the most critical parts of your code
- •Break dependence chains
- •Use local variables
- •Use array of structures rather than structure of arrays
- •Alignment of data
- •Division
- •Function calls
- •Conversion from floating-point numbers to integers
- •Character arrays versus string objects
- •Combining assembly and high level language
- •Inline assembly
- •Calling conventions
- •Data storage in C++
- •Register usage in 16 bit mode DOS or Windows
- •Register usage in 32 bit Windows
- •Register usage in Linux
- •Making compiler-independent code
- •Adding support for multiple compilers in .asm modules
- •Further compiler incompatibilities
- •Object file formats
- •Using MASM under Linux
- •Object oriented programming
- •Other high level languages
- •Debugging and verifying assembly code
- •Reducing code size
- •Detecting processor type
- •Checking for operating system support for XMM registers
- •Alignment
- •Cache
- •First time versus repeated execution
- •Out-of-order execution (PPro, P2, P3, P4)
- •Instructions are split into uops
- •Register renaming
- •Dependence chains
- •Branch prediction (all processors)
- •Prediction methods for conditional jumps
- •Branch prediction in P1
- •Branch prediction in PMMX, PPro, P2, and P3
- •Branch prediction in P4
- •Indirect jumps (all processors)
- •Returns (all processors except P1)
- •Static prediction
- •Close jumps
- •Avoiding jumps (all processors)
- •Optimizing for P1 and PMMX
- •Pairing integer instructions
- •Address generation interlock
- •Splitting complex instructions into simpler ones
- •Prefixes
- •Scheduling floating-point code
- •Optimizing for PPro, P2, and P3
- •The pipeline in PPro, P2 and P3
- •Register renaming
- •Register read stalls
- •Out of order execution
- •Retirement
- •Partial register stalls
- •Partial memory stalls
- •Bottlenecks in PPro, P2, P3
- •Optimizing for P4
- •Trace cache
- •Instruction decoding
- •Execution units
- •Do the floating-point and MMX units run at half speed?
- •Transfer of data between execution units
- •Retirement
- •Partial registers and partial flags
- •Partial memory access
- •Memory intermediates in dependencies
- •Breaking dependencies
- •Choosing the optimal instructions
- •Bottlenecks in P4
- •Loop optimization (all processors)
- •Loops in P1 and PMMX
- •Loops in PPro, P2, and P3
- •Loops in P4
- •Macro loops (all processors)
- •Single-Instruction-Multiple-Data programming
- •Problematic Instructions
- •XCHG (all processors)
- •Shifts and rotates (P4)
- •Rotates through carry (all processors)
- •String instructions (all processors)
- •Bit test (all processors)
- •Integer multiplication (all processors)
- •Division (all processors)
- •LEA instruction (all processors)
- •WAIT instruction (all processors)
- •FCOM + FSTSW AX (all processors)
- •FPREM (all processors)
- •FRNDINT (all processors)
- •FSCALE and exponential function (all processors)
- •FPTAN (all processors)
- •FSQRT (P3 and P4)
- •FLDCW (PPro, P2, P3, P4)
- •Bit scan (P1 and PMMX)
- •Special topics
- •Freeing floating-point registers (all processors)
- •Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
- •Converting from floating-point to integer (All processors)
- •Using integer instructions for floating-point operations
- •Using floating-point instructions for integer operations
- •Moving blocks of data (All processors)
- •Self-modifying code (All processors)
- •Testing speed
- •List of instruction timings for P1 and PMMX
- •Integer instructions (P1 and PMMX)
- •Floating-point instructions (P1 and PMMX)
- •MMX instructions (PMMX)
- •List of instruction timings and uop breakdown for PPro, P2 and P3
- •Integer instructions (PPro, P2 and P3)
- •Floating-point instructions (PPro, P2 and P3)
- •MMX instructions (P2 and P3)
- •List of instruction timings and uop breakdown for P4
- •integer instructions
- •Floating-point instructions
- •SIMD integer instructions
- •SIMD floating-point instructions
- •Comparison of the different microprocessors
An example of an assembly language function library that can be called from many different languages and platforms can be found in www.agner.org/random/randoma.zip.
5 Debugging and verifying assembly code
Debugging assembly code can be quite hard and frustrating, as you probably already have discovered. I would recommend that you start with writing the piece of code you want to optimize as a subroutine in C++. Next, write a test program that can test your subroutine thoroughly. Make sure the test program goes into all branches and boundary cases.
When your C++ subroutine works with your test program then you are ready to translate the code to assembly language. Most C++ compilers can translate C++ to assembly.
Now you can start to optimize. Each time you have made a modification, you should run it on the test program to see if it works correctly. Number all your versions and save them so that you can go back and test them again in case you discover an error that the test program didn't catch (such as writing to a wrong address).
Test the speed of the most critical part of your program with the methods described in chapter 20 page 132. If the code is significantly slower than expected, then check the list of possible bottlenecks on page 75 for PPro, P2 and P3, and page 95 for P4.
Highly optimized code tends to be very difficult to read and understand for others, and even for yourself when you get back to it after some time. In order to make it possible to maintain the code, it is important that you organize it into small logical units (procedures or macros) with a well-defined interface and appropriate comments. The more complicated the code is to read, the more important is a good documentation.
6 Reducing code size
As explained in chapter 9 page 29, the code cache is 8 or 16 kb on P1, PMMX, PPro, P2 and P3. If you have problems keeping the critical parts of your code within the code cache, then you may consider reducing the size of your code. You may also want to reduce the size of your code if speed is not important.
32-bit code is usually bigger than 16-bit code because addresses and data constants take 4 bytes in 32-bit code and only 2 bytes in 16-bit code. However, 16-bit code has other penalties, especially because of segment prefixes. Some other methods for reducing the size or your code are discussed below.
Both jump addresses, data addresses, and data constants take less space if they can be expressed as a sign-extended byte, i.e. if they are within the interval from -128 to +127.
For jump addresses, this means that short jumps take two bytes of code, whereas jumps beyond 127 bytes take 5 bytes if unconditional and 6 bytes if conditional.
Likewise, data addresses take less space if they can be expressed as a pointer and a displacement between -128 and +127. Example:
MOV EBX,DS:[100000] / ADD EBX,DS:[100004] ; 12 bytes
Reduce to:
MOV EAX,100000 / MOV EBX,[EAX] / ADD EBX,[EAX+4] ; 10 bytes
The advantage of using a pointer obviously increases if you use it many times. Storing data on the stack and using EBP or ESP as pointer will thus make your code smaller than if you
use static memory locations and absolute addresses, provided of course that your data are within +/-127 bytes of the pointer. Using PUSH and POP to write and read temporary data is even shorter.
Data constants may also take less space if they are between -128 and +127. Most instructions with immediate operands have a short form where the operand is a signextended single byte. Examples:
PUSH 200 |
; 5 |
bytes |
||
PUSH 100 |
; 2 |
bytes |
||
ADD |
EBX,128 |
; |
6 |
bytes |
SUB |
EBX,-128 |
; |
3 |
bytes |
The most important instruction with an immediate operand that does not have such a short form is MOV. Examples:
MOV EAX, 0 |
; 5 bytes |
May be changed to:
SUB EAX,EAX |
; 2 bytes |
And
MOV EAX, 1 |
; 5 bytes |
May be changed to:
SUB EAX,EAX / INC EAX |
; 3 bytes |
or:
PUSH 1 / POP EAX |
; 3 bytes |
And
MOV EAX, -1 |
; 5 bytes |
May be changed to:
OR EAX, -1 |
; 3 bytes |
If the same address or constant is used more than once then you may load it into a register. A MOV with a 4-byte immediate operand may sometimes be replaced by an arithmetic instruction if the value of the register before the MOV is known. Example:
MOV |
[mem1],200 |
; 10 bytes |
||
MOV |
[mem2],200 |
; 10 |
bytes |
|
MOV |
[mem3],201 |
; 10 |
bytes |
|
MOV |
EAX,100 |
; |
5 |
bytes |
MOV |
EBX,150 |
; |
5 |
bytes |
Assuming that mem1 and mem3 are both within -128/+127 bytes of mem2, this may be changed to:
MOV |
EBX,OFFSET mem2 |
; |
5 |
bytes |
MOV |
EAX,200 |
; |
5 |
bytes |
MOV |
[EBX+mem1-mem2],EAX |
; |
3 |
bytes |
MOV |
[EBX],EAX |
; |
2 |
bytes |