- •Introduction
- •Assembly language syntax
- •Microprocessor versions covered by this manual
- •Getting started with optimization
- •Speed versus program clarity and security
- •Choice of programming language
- •Choice of algorithm
- •Memory model
- •Finding the hot spots
- •Literature
- •Optimizing in C++
- •Use optimization options
- •Identify the most critical parts of your code
- •Break dependence chains
- •Use local variables
- •Use array of structures rather than structure of arrays
- •Alignment of data
- •Division
- •Function calls
- •Conversion from floating-point numbers to integers
- •Character arrays versus string objects
- •Combining assembly and high level language
- •Inline assembly
- •Calling conventions
- •Data storage in C++
- •Register usage in 16 bit mode DOS or Windows
- •Register usage in 32 bit Windows
- •Register usage in Linux
- •Making compiler-independent code
- •Adding support for multiple compilers in .asm modules
- •Further compiler incompatibilities
- •Object file formats
- •Using MASM under Linux
- •Object oriented programming
- •Other high level languages
- •Debugging and verifying assembly code
- •Reducing code size
- •Detecting processor type
- •Checking for operating system support for XMM registers
- •Alignment
- •Cache
- •First time versus repeated execution
- •Out-of-order execution (PPro, P2, P3, P4)
- •Instructions are split into uops
- •Register renaming
- •Dependence chains
- •Branch prediction (all processors)
- •Prediction methods for conditional jumps
- •Branch prediction in P1
- •Branch prediction in PMMX, PPro, P2, and P3
- •Branch prediction in P4
- •Indirect jumps (all processors)
- •Returns (all processors except P1)
- •Static prediction
- •Close jumps
- •Avoiding jumps (all processors)
- •Optimizing for P1 and PMMX
- •Pairing integer instructions
- •Address generation interlock
- •Splitting complex instructions into simpler ones
- •Prefixes
- •Scheduling floating-point code
- •Optimizing for PPro, P2, and P3
- •The pipeline in PPro, P2 and P3
- •Register renaming
- •Register read stalls
- •Out of order execution
- •Retirement
- •Partial register stalls
- •Partial memory stalls
- •Bottlenecks in PPro, P2, P3
- •Optimizing for P4
- •Trace cache
- •Instruction decoding
- •Execution units
- •Do the floating-point and MMX units run at half speed?
- •Transfer of data between execution units
- •Retirement
- •Partial registers and partial flags
- •Partial memory access
- •Memory intermediates in dependencies
- •Breaking dependencies
- •Choosing the optimal instructions
- •Bottlenecks in P4
- •Loop optimization (all processors)
- •Loops in P1 and PMMX
- •Loops in PPro, P2, and P3
- •Loops in P4
- •Macro loops (all processors)
- •Single-Instruction-Multiple-Data programming
- •Problematic Instructions
- •XCHG (all processors)
- •Shifts and rotates (P4)
- •Rotates through carry (all processors)
- •String instructions (all processors)
- •Bit test (all processors)
- •Integer multiplication (all processors)
- •Division (all processors)
- •LEA instruction (all processors)
- •WAIT instruction (all processors)
- •FCOM + FSTSW AX (all processors)
- •FPREM (all processors)
- •FRNDINT (all processors)
- •FSCALE and exponential function (all processors)
- •FPTAN (all processors)
- •FSQRT (P3 and P4)
- •FLDCW (PPro, P2, P3, P4)
- •Bit scan (P1 and PMMX)
- •Special topics
- •Freeing floating-point registers (all processors)
- •Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
- •Converting from floating-point to integer (All processors)
- •Using integer instructions for floating-point operations
- •Using floating-point instructions for integer operations
- •Moving blocks of data (All processors)
- •Self-modifying code (All processors)
- •Testing speed
- •List of instruction timings for P1 and PMMX
- •Integer instructions (P1 and PMMX)
- •Floating-point instructions (P1 and PMMX)
- •MMX instructions (PMMX)
- •List of instruction timings and uop breakdown for PPro, P2 and P3
- •Integer instructions (PPro, P2 and P3)
- •Floating-point instructions (PPro, P2 and P3)
- •MMX instructions (P2 and P3)
- •List of instruction timings and uop breakdown for P4
- •integer instructions
- •Floating-point instructions
- •SIMD integer instructions
- •SIMD floating-point instructions
- •Comparison of the different microprocessors
10 First time versus repeated execution
A piece of code usually takes much more time the first time it is executed than when it is repeated. The reasons are the following:
1.Loading the code from RAM into the cache takes longer time than executing it.
2.Any data accessed by the code has to be loaded into the data cache, which may take much more time than executing the instructions. The data are more likely to be in the cache when the code is repeated.
3.Jump instructions will not be in the branch target buffer the first time they execute, and therefore are less likely to be predicted correctly. See chapter 12 page 35.
4.In many processors, decoding the length of each instruction is a bottleneck. The P1 solves this problem by remembering the length of any instruction which has remained in the cache since last time it was executed. As a consequence of this, a set of instructions will not pair in the P1 the first time they are executed, unless the first of the two instructions is only one byte long. The PMMX, PPro, P2 and P3 have no penalty on first time decoding. On the P4, instructions go directly from the decoder to the execution unit the first time they are executed. On subsequent executions they are submitted from the trace cache at the rate of 3 uops per clock.
For these four reasons, a piece of code inside a loop will generally take much more time the first time it executes than the subsequent times.
If you have a big loop which doesn't fit into the trace cache or code cache then you will get penalties all the time because it doesn't run from the cache. You should therefore try to reorganize the loop to make it fit into the cache.
If you have very many jumps, calls, and branches inside a loop, then you may get the penalty of branch target buffer misses repeatedly.
Likewise, if a loop repeatedly accesses a data structure too big for the data cache, then you will get the penalty of data cache misses all the time.
11 Out-of-order execution (PPro, P2, P3, P4)
The most important improvement that came with the sixth generation of microprocessors is out-of-order execution. The idea is that if the execution of a particular instruction is delayed because the input data are not available yet, then the microprocessor will try to find later instructions that it can do first, if the input data for the latter instructions are ready. Obviously, the microprocessor has to check if the latter instructions need the output from the former instruction. If each instruction depends on the result of the preceding instruction, then we have no opportunities for out-of-order execution. The logic for determining input dependences and the mechanisms for doing instructions as soon as the necessary inputs are ready, gives us the further advantage that the microprocessor can do several things at the same time. If we need to do an addition and a multiplication, and neither instruction depends on the output of the other, then we can do both at the same time, because they are using two different execution units. But we cannot do two multiplications at the same time, because both will need to use the same execution unit.
Typically, everything in these microprocessors is highly pipelined in order to improve the throughput. If, for example, a floating-point addition takes 5 clock cycles, and the execution unit is fully pipelined, then we can start one addition at time T, which will be finished at time T+5, and start another addition at time T+1, which will be finished at time T+6. The advantage of this technology is therefore highest if we can organize our code so that there are as few dependencies as possible between successive instructions.
11.1 Instructions are split into uops
The microprocessors with out-of-order execution are translating all instructions into microoperations - abbreviated uops. A simple instruction such as ADD EAX,EBX generates only one uop, while an instruction like ADD EAX,[MEM1] generates two: one for reading from memory into a temporary (unnamed) register, and one for adding the contents of the temporary register to EAX. The instruction ADD [MEM1],EAX generates three uops: one for reading from memory, one for adding, and one for writing the result back to memory. The advantage of this is that the uops can be executed out of order. Example:
MOV EAX, [MEM1]
IMUL EAX, 5
ADD EAX, [MEM2]
MOV [MEM3], EAX
Here, the ADD EAX,[MEM2] instruction is split into two uops. The advantage of this is that the microprocessor can fetch the value of [MEM2] at the same time as it is doing the multiplication. If none of the data are in the cache, then the microprocessor will start to fetch [MEM2] immediately after starting to fetch [MEM1], and long before the multiplication can start. This principle also makes the stack work more efficiently. Consider the example:
PUSH EAX
CALL FUNC
The PUSH EAX instruction is split into two uops which can be represented as SUB ESP,4 and MOV [ESP],EAX. The advantage of this is that the SUB ESP,4 uop can be executed even if the value of EAX is not ready yet. The CALL operation needs the new value of ESP, so the CALL would have to wait for the value of EAX if the PUSH instruction was not split into uops. Thanks to the use of uops, the value of the stack pointer almost never causes delays in normal programs.
½½
11.2 Register renaming
Consider the example:
MOV EAX, [MEM1]
IMUL EAX, 6
MOV [MEM2], EAX
MOV EAX, [MEM3]
ADD EAX, 2
MOV [MEM4], EAX
This piece of code is doing two things that have nothing to do with each other: multiplying [MEM1] by 6 and adding 2 to [MEM3]. If we were using a different register in the last three instructions, then the independence would be obvious. And, in fact, the microprocessor is actually smart enough to do just that. It is using a different temporary register in the last three instructions so that it can do the multiplication and the addition in parallel. The instruction set gives us only seven general-purpose 32-bit registers, and often we are using them all. So we cannot afford the luxury of using a new register for every calculation. But the microprocessor has more temporal registers to use. The PPro, P2 and P3 have 40 universal temporary registers, and the P4 has 128. The microprocessor can rename any of these temporary registers to represent the logical register EAX.
Register renaming works fully automatically and in a very simple way. Every time an instruction writes to or modifies a logical register, the microprocessor assigns a new temporary register to that logical register. The first instruction in the above example will
assign one temporary register to EAX. The second instruction is putting a new value into EAX, so a new temporary register will be assigned here. In other words, the multiplication instruction will use two different registers, one for input and another one for output. The next example illustrates the advantage of this:
MOV EAX, [MEM1]
MOV EBX, [MEM2]
ADD EBX, EAX
IMUL EAX, 6
MOV [MEM3], EAX
MOV [MEM4], EBX
Assume, now, that [MEM1] is in the cache, while [MEM2] is not. This means that the multiplication can start before the addition. The advantage of using a new temporary register for the result of the multiplication is that we still have the old value of EAX, which has to be kept until EBX is ready for the addition. If we had used the same register for the input and output of the multiplication, then the multiplication would have to wait until the loading of EBX and the addition was finished.
After all the operations are finished, the value in the temporary register that represents the last value of EAX in the code sequence is written to a permanent EAX register. This process is called retirement (see page 71 and 88).
All general purpose registers, stack pointer, flags, floating-point registers, MMX registers, XMM registers and segment registers can be renamed. Control words, and the floating-point status word cannot be renamed, and this is the reason why the use of these registers is slow.
11.3 Dependence chains
A series of instructions where each instruction depends on the result of the preceding one is called a dependence chain. Long dependence chains should be avoided, if possible, because they prevent out-of-order and parallel execution. Example:
MOV EAX, [MEM1]
IMUL EAX, [MEM2]
IMUL EAX, [MEM3]
IMUL EAX, [MEM4]
MOV [MEM5], EAX
In this example, the IMUL instructions generate several uops each, one for reading from memory, and one or more for multiplying. The read uops can execute out or order, while the multiplication uops must wait for the previous uops to finish. This dependence chain takes quite a long time, because multiplication is slow. You can improve this code by breaking up the dependence chain. The trick is to use multiple accumulators:
MOV |
EAX, [MEM1] |
; start first |
chain |
MOV |
EBX, [MEM2] |
; start other |
chain in different accumulator |
IMUL EAX, [MEM3] |
|
|
|
IMUL EBX, [MEM4] |
|
|
|
IMUL EAX, EBX |
; join chains |
in the end |
|
MOV [MEM5], EAX |
|
|
Here, the second IMUL instruction can start before the first one is finished.
Floating-point instructions often have a longer latency than integer instructions, so you should definitely break up long dependence chains with floating-point instructions:
FLD [MEM1] |
; start first chain |