- •Introduction
- •Assembly language syntax
- •Microprocessor versions covered by this manual
- •Getting started with optimization
- •Speed versus program clarity and security
- •Choice of programming language
- •Choice of algorithm
- •Memory model
- •Finding the hot spots
- •Literature
- •Optimizing in C++
- •Use optimization options
- •Identify the most critical parts of your code
- •Break dependence chains
- •Use local variables
- •Use array of structures rather than structure of arrays
- •Alignment of data
- •Division
- •Function calls
- •Conversion from floating-point numbers to integers
- •Character arrays versus string objects
- •Combining assembly and high level language
- •Inline assembly
- •Calling conventions
- •Data storage in C++
- •Register usage in 16 bit mode DOS or Windows
- •Register usage in 32 bit Windows
- •Register usage in Linux
- •Making compiler-independent code
- •Adding support for multiple compilers in .asm modules
- •Further compiler incompatibilities
- •Object file formats
- •Using MASM under Linux
- •Object oriented programming
- •Other high level languages
- •Debugging and verifying assembly code
- •Reducing code size
- •Detecting processor type
- •Checking for operating system support for XMM registers
- •Alignment
- •Cache
- •First time versus repeated execution
- •Out-of-order execution (PPro, P2, P3, P4)
- •Instructions are split into uops
- •Register renaming
- •Dependence chains
- •Branch prediction (all processors)
- •Prediction methods for conditional jumps
- •Branch prediction in P1
- •Branch prediction in PMMX, PPro, P2, and P3
- •Branch prediction in P4
- •Indirect jumps (all processors)
- •Returns (all processors except P1)
- •Static prediction
- •Close jumps
- •Avoiding jumps (all processors)
- •Optimizing for P1 and PMMX
- •Pairing integer instructions
- •Address generation interlock
- •Splitting complex instructions into simpler ones
- •Prefixes
- •Scheduling floating-point code
- •Optimizing for PPro, P2, and P3
- •The pipeline in PPro, P2 and P3
- •Register renaming
- •Register read stalls
- •Out of order execution
- •Retirement
- •Partial register stalls
- •Partial memory stalls
- •Bottlenecks in PPro, P2, P3
- •Optimizing for P4
- •Trace cache
- •Instruction decoding
- •Execution units
- •Do the floating-point and MMX units run at half speed?
- •Transfer of data between execution units
- •Retirement
- •Partial registers and partial flags
- •Partial memory access
- •Memory intermediates in dependencies
- •Breaking dependencies
- •Choosing the optimal instructions
- •Bottlenecks in P4
- •Loop optimization (all processors)
- •Loops in P1 and PMMX
- •Loops in PPro, P2, and P3
- •Loops in P4
- •Macro loops (all processors)
- •Single-Instruction-Multiple-Data programming
- •Problematic Instructions
- •XCHG (all processors)
- •Shifts and rotates (P4)
- •Rotates through carry (all processors)
- •String instructions (all processors)
- •Bit test (all processors)
- •Integer multiplication (all processors)
- •Division (all processors)
- •LEA instruction (all processors)
- •WAIT instruction (all processors)
- •FCOM + FSTSW AX (all processors)
- •FPREM (all processors)
- •FRNDINT (all processors)
- •FSCALE and exponential function (all processors)
- •FPTAN (all processors)
- •FSQRT (P3 and P4)
- •FLDCW (PPro, P2, P3, P4)
- •Bit scan (P1 and PMMX)
- •Special topics
- •Freeing floating-point registers (all processors)
- •Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
- •Converting from floating-point to integer (All processors)
- •Using integer instructions for floating-point operations
- •Using floating-point instructions for integer operations
- •Moving blocks of data (All processors)
- •Self-modifying code (All processors)
- •Testing speed
- •List of instruction timings for P1 and PMMX
- •Integer instructions (P1 and PMMX)
- •Floating-point instructions (P1 and PMMX)
- •MMX instructions (PMMX)
- •List of instruction timings and uop breakdown for PPro, P2 and P3
- •Integer instructions (PPro, P2 and P3)
- •Floating-point instructions (PPro, P2 and P3)
- •MMX instructions (P2 and P3)
- •List of instruction timings and uop breakdown for P4
- •integer instructions
- •Floating-point instructions
- •SIMD integer instructions
- •SIMD floating-point instructions
- •Comparison of the different microprocessors
AND |
EAX,ECX |
|
MOV |
DWORD PTR [TEMP],EAX |
|
FILD |
QWORD PTR [TEMP] |
|
FSTP |
QWORD PTR [TEMP] |
|
WAIT |
; WAIT only needed for compatibility with old 80287 |
|
MOV |
ECX, DWORD PTR [TEMP+4] |
|
SHR |
ECX,20 |
|
SUB |
ECX,3FFH |
|
TEST |
EAX,EAX |
; clear zero flag |
BS2:
These emulation codes should not be used on later processors.
19 Special topics
19.1 Freeing floating-point registers (all processors)
You have to free all used floating-point registers before exiting a subroutine, except for any register used for the result.
The fastest way of freeing one register is FSTP ST. The fastest way of freeing two registers is FCOMPP on P1 and PMMX. On later processors you may use either FCOMPP or twice FSTP ST, whichever fits best into the decoding sequence (PPro, P2, P3) or port load (P4).
It is not recommended to use FFREE.
19.2 Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
It is not possible to use 64-bit MMX registers and 80-bit floating-point registers in the same part of the code. You must issue an EMMS instruction after the last instruction that uses 64bit MMX registers if there is a possibility that later code uses floating-point registers. You may avoid this problem by using 128-bit XMM registers instead.
On PMMX there is a high penalty for switching between floating-point and MMX instructions. The first floating-point instruction after an EMMS takes approximately 58 clocks extra, and the first MMX instruction after a floating-point instruction takes approximately 38 clocks extra.
On P2, P3 and P4 there is no such penalty. The delay after EMMS can be hidden by putting in integer instructions between EMMS and the first floating-point instruction.
19.3 Converting from floating-point to integer (All processors)
All conversions between floating-point registers and integer registers must go via a memory location:
FISTP DWORD PTR [TEMP]
MOV EAX, [TEMP]
On PPro, P2, P3 and especially P4, this code is likely to have a penalty for attempting to read from [TEMP] before the write to [TEMP] is finished. It doesn't help to put in aWAIT. It is recommended that you put in other instructions between the write to [TEMP] and the read from [TEMP] if possible in order to avoid this penalty. This applies to all the examples that follow.
The specifications for the C and C++ language requires that conversion from floating-point numbers to integers use truncation rather than rounding. The method used by most C libraries is to change the floating-point control word to indicate truncation before using an FISTP instruction, and changing it back again afterwards. This method is very slow on all processors. On PPro and later processors, the floating-point control word cannot be renamed, so all subsequent floating-point instructions must wait for the FLDCW instruction to retire. See page 125.
On the P3 and P4 you can avoid all these problems by using XMM registers instead of floating-point registers and use the CVT.. instructions to avoid the memory intermediate. (On the P3, these instructions are only available in single precision).
Whenever you have a conversion from a floating-point register to an integer register, you should think of whether you can use rounding to nearest integer instead of truncation.
If you need truncation inside a loop then you should change the control word only outside the loop if the rest of the floating-point instructions in the loop can work correctly in truncation mode.
You may use various tricks for truncating without changing the control word, as illustrated in the examples below. These examples presume that the control word is set to default, i.e. rounding to nearest or even.
;Rounding to nearest or even:
;extern "C" int round (double x); _round PROC NEAR
PUBLIC _round
FLD |
QWORD PTR |
[ESP+4] |
FISTP |
DWORD PTR |
[ESP+4] |
MOV |
EAX, DWORD PTR [ESP+4] |
|
RET |
|
|
_round ENDP
;Truncation towards zero:
;extern "C" int truncate (double x);
_truncate PROC |
NEAR |
|
PUBLIC _truncate |
|
|
FLD |
QWORD PTR [ESP+4] |
; x |
SUB |
ESP, 12 |
; space for local variables |
FIST |
DWORD PTR [ESP] |
; rounded value |
FST |
DWORD PTR [ESP+4] |
; float value |
FISUB |
DWORD PTR [ESP] |
; subtract rounded value |
FSTP |
DWORD PTR [ESP+8] |
; difference |
POP |
EAX |
; rounded value |
POP |
ECX |
; float value |
POP |
EDX |
; difference (float) |
TEST |
ECX, ECX |
; test sign of x |
JS |
SHORT NEGATIVE |
|
ADD |
EDX, 7FFFFFFFH |
; produce carry if difference < -0 |
SBB |
EAX, 0 |
; subtract 1 if x-round(x) < -0 |
RET |
|
|
NEGATIVE: |
|
|
XOR |
ECX, ECX |
|
TEST |
EDX, EDX |
|
SETG |
CL |
; 1 if difference > 0 |
ADD |
EAX, ECX |
; add 1 if x-round(x) > 0 |
RET |
|
|
_truncate ENDP |
|
|
;Truncation towards minus infinity:
;extern "C" int ifloor (double x);
_ifloor PROC |
NEAR |
PUBLIC |
_ifloor |
|
|
|
FLD |
QWORD PTR [ESP+4] |
; x |
|
SUB |
ESP, 8 |
; space for local variables |
|
FIST |
DWORD PTR [ESP] |
; rounded value |
|
FISUB |
DWORD PTR [ESP] |
; subtract rounded value |
|
FSTP |
DWORD PTR [ESP+4] |
; difference |
|
POP |
EAX |
; rounded value |
|
POP |
EDX |
; difference (float) |
|
ADD |
EDX, 7FFFFFFFH |
; produce carry if difference < -0 |
|
SBB |
EAX, 0 |
; subtract 1 if x-round(x) < -0 |
|
RET |
|
|
_ifloor |
ENDP |
|
|
These procedures work for -231 < x < 231-1. They do not check for overflow or NAN's.
19.4 Using integer instructions for floating-point operations
Integer instructions are generally faster than floating-point instructions, so it is often advantageous to use integer instructions for doing simple floating-point operations. The most obvious example is moving data. For example
FLD QWORD PTR [ESI] / FSTP QWORD PTR [EDI]
can be replaced by:
MOV EAX,[ESI] / MOV EBX,[ESI+4] / MOV [EDI],EAX / MOV [EDI+4],EBX
or:
MOVQ MM0,[ESI] / MOVQ [EDI],MM0
Many other manipulations are possible if you know how floating-point numbers are represented in binary format. The floating-point format used in registers as well as in memory is in accordance with the IEEE-754 standard. Future implementations are certain to use the same format. The floating-point format consists of three parts: the sign s, mantissa m, and exponent e:
e
x = s· m· 2.
The sign s is represented as one bit, where a zero means +1 and a one means -1. The mantissa is a value in the interval 1 ≤ m < 2. The binary representation of m always has a 1 before the radix point. This 1 is not stored, except in the long double (80 bits) format. Thus, the left-most bit of the mantissa represents ½, the next bit represents ¼, etc. The exponent e can be both positive and negative. It is not stored in the usual 2-complement signed format, but in a biased format where 0 is represented by the value that has all but the most significant bit = 1. This format makes comparisons easier. The value x = 0.0 is represented by setting all bits of m and e to zero. The sign bit may be 0 or 1 so we can actually distinguish between +0.0 and -0.0, but comparisons must of course treat +0.0 and -0.0 as equal. The bit positions are shown in this table:
precision |
mantissa |
always 1 |
exponent |
sign |
single (32 bits) |
bit 0 - 22 |
|
bit 23 - 30 |
bit 31 |
double (64 bits) |
bit 0 - 51 |
|
bit 52 - 62 |
bit 63 |
long double (80 bits) |
bit 0 - 62 |
bit 63 |
bit 64 - 78 |
bit 79 |
From this table we can find that the value 1.0 is represented as 3F80,0000H in single precision format, 3FF0,0000,0000,0000H in double precision, and 3FFF,8000,0000,0000,0000H in long double precision.
Generating constants
It is possible to generate simple floating-point constants without using data in memory:
; generate four single-precision |
values = |
1.0 |
||||
PCMPEQD |
XMM0,XMM0 |
; generate all 1's |
|
|||
PSRLD |
XMM0,25 |
; |
seven |
1's |
|
|
PSLLD |
XMM0,23 |
; |
shift |
into |
exponent |
field |
To generate the constant 0.0, it is better to use PXOR XMM0,XMM0 than XORPS, XORPD, SUBPS, etc., because the PXOR instruction is recognized by the P4 processor to be independent of the previous value of the register if source and destination are the same, while this is not the case for the other instructions.
Testing if a floating-point value is zero
To test if a floating-point number is zero, we have to test all bits except the sign bit, which may be either 0 or 1. For example:
FLD DWORD PTR [EBX] / FTST / FNSTSW AX / AND AH,40H / JNZ IsZero
can be replaced by
MOV EAX,[EBX] / ADD EAX,EAX / JZ IsZero
where the ADD EAX,EAX shifts out the sign bit. Double precision floats have 63 bits to test, but if denormal numbers can be ruled out, then you can be certain that the value is zero if the exponent bits are all zero. Example:
FLD QWORD PTR [EBX] / FTST / FNSTSW AX / AND AH,40H / JNZ IsZero
can be replaced by
MOV EAX,[EBX+4] / ADD EAX,EAX / JZ IsZero
Manipulating the sign bit
A floating-point number is negative if the sign bit is set and at least one other bit is set. Example (single precision):
MOV EAX,[NumberToTest] / CMP EAX,80000000H / JA IsNegative
You can change the sign of a floating-point number simply by flipping the sign bit. This is useful when XMM registers are used, because there is no XMM change sign instruction. Example:
; change sign of four |
single-precision floats in XMM0 |
|||
CMPEQD |
XMM1,XMM1 |
; |
generate |
all 1's |
PSLLD |
XMM1,31 |
; |
1 in the |
leftmost bit of each DWORD only |
XORPS |
XMM0,XMM1 |
; |
change sign of XMM0 |
You can get the absolute value of a floating-point number by AND'ing out the sign bit:
; absolute value of four single-precision floats in |
XMM0 |
||||||
CMPEQD |
XMM1,XMM1 |
; generate |
all |
1's |
|
|
|
PSRLD |
XMM1,1 |
; |
1 in all |
but |
the |
leftmost bit |
of each DWORD |
ANDPS |
XMM0,XMM1 |
; |
set sign |
bits to |
0 |
|
You can extract the sign bit of a floating-point number:
;generate a bit-mask if single-precision floats in XMM0 are..
;negative or -0.0