- •Introduction
- •Assembly language syntax
- •Microprocessor versions covered by this manual
- •Getting started with optimization
- •Speed versus program clarity and security
- •Choice of programming language
- •Choice of algorithm
- •Memory model
- •Finding the hot spots
- •Literature
- •Optimizing in C++
- •Use optimization options
- •Identify the most critical parts of your code
- •Break dependence chains
- •Use local variables
- •Use array of structures rather than structure of arrays
- •Alignment of data
- •Division
- •Function calls
- •Conversion from floating-point numbers to integers
- •Character arrays versus string objects
- •Combining assembly and high level language
- •Inline assembly
- •Calling conventions
- •Data storage in C++
- •Register usage in 16 bit mode DOS or Windows
- •Register usage in 32 bit Windows
- •Register usage in Linux
- •Making compiler-independent code
- •Adding support for multiple compilers in .asm modules
- •Further compiler incompatibilities
- •Object file formats
- •Using MASM under Linux
- •Object oriented programming
- •Other high level languages
- •Debugging and verifying assembly code
- •Reducing code size
- •Detecting processor type
- •Checking for operating system support for XMM registers
- •Alignment
- •Cache
- •First time versus repeated execution
- •Out-of-order execution (PPro, P2, P3, P4)
- •Instructions are split into uops
- •Register renaming
- •Dependence chains
- •Branch prediction (all processors)
- •Prediction methods for conditional jumps
- •Branch prediction in P1
- •Branch prediction in PMMX, PPro, P2, and P3
- •Branch prediction in P4
- •Indirect jumps (all processors)
- •Returns (all processors except P1)
- •Static prediction
- •Close jumps
- •Avoiding jumps (all processors)
- •Optimizing for P1 and PMMX
- •Pairing integer instructions
- •Address generation interlock
- •Splitting complex instructions into simpler ones
- •Prefixes
- •Scheduling floating-point code
- •Optimizing for PPro, P2, and P3
- •The pipeline in PPro, P2 and P3
- •Register renaming
- •Register read stalls
- •Out of order execution
- •Retirement
- •Partial register stalls
- •Partial memory stalls
- •Bottlenecks in PPro, P2, P3
- •Optimizing for P4
- •Trace cache
- •Instruction decoding
- •Execution units
- •Do the floating-point and MMX units run at half speed?
- •Transfer of data between execution units
- •Retirement
- •Partial registers and partial flags
- •Partial memory access
- •Memory intermediates in dependencies
- •Breaking dependencies
- •Choosing the optimal instructions
- •Bottlenecks in P4
- •Loop optimization (all processors)
- •Loops in P1 and PMMX
- •Loops in PPro, P2, and P3
- •Loops in P4
- •Macro loops (all processors)
- •Single-Instruction-Multiple-Data programming
- •Problematic Instructions
- •XCHG (all processors)
- •Shifts and rotates (P4)
- •Rotates through carry (all processors)
- •String instructions (all processors)
- •Bit test (all processors)
- •Integer multiplication (all processors)
- •Division (all processors)
- •LEA instruction (all processors)
- •WAIT instruction (all processors)
- •FCOM + FSTSW AX (all processors)
- •FPREM (all processors)
- •FRNDINT (all processors)
- •FSCALE and exponential function (all processors)
- •FPTAN (all processors)
- •FSQRT (P3 and P4)
- •FLDCW (PPro, P2, P3, P4)
- •Bit scan (P1 and PMMX)
- •Special topics
- •Freeing floating-point registers (all processors)
- •Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
- •Converting from floating-point to integer (All processors)
- •Using integer instructions for floating-point operations
- •Using floating-point instructions for integer operations
- •Moving blocks of data (All processors)
- •Self-modifying code (All processors)
- •Testing speed
- •List of instruction timings for P1 and PMMX
- •Integer instructions (P1 and PMMX)
- •Floating-point instructions (P1 and PMMX)
- •MMX instructions (PMMX)
- •List of instruction timings and uop breakdown for PPro, P2 and P3
- •Integer instructions (PPro, P2 and P3)
- •Floating-point instructions (PPro, P2 and P3)
- •MMX instructions (P2 and P3)
- •List of instruction timings and uop breakdown for P4
- •integer instructions
- •Floating-point instructions
- •SIMD integer instructions
- •SIMD floating-point instructions
- •Comparison of the different microprocessors
; define data members of class MyList: |
||
MyList |
STRUC |
|
LENGTH_ DD |
? |
|
BUFFER |
DD |
100 DUP (?) |
MyList |
ENDS |
|
; define member function |
MyListAddItem with __cdecl calling |
method: |
|
MyListAddItem PROC NEAR |
|
; extern "C" friend |
(UNIX) |
_MyListAddItem LABEL NEAR |
|
; extern "C" friend |
(Windows) |
@MyList@AddItem$qie LABEL NEAR |
; Borland |
|
|
?AddItem@MyList@@QAAXHZZ |
LABEL NEAR |
; Microsoft |
|
_AddItem__6MyListie LABEL NEAR |
; Gnu (Windows) |
|
|
AddItem__6MyListie LABEL |
NEAR |
; Gnu (Redhat, Debian, BSD) |
|
_ZN6MyList7AddItemEiz LABEL NEAR |
; Gnu (Mandrake, UNIX) |
PUBLIC MyListAddItem, _MyListAddItem, @MyList@AddItem$qie
PUBLIC ?AddItem@MyList@@QAAXHZZ, _AddItem__6MyListie
PUBLIC AddItem__6MyListie, _ZN6MyList7AddItemEiz
MOV |
ECX, [ESP+4] |
; |
'this' |
MOV |
EAX, [ESP+8] |
; |
item |
MOV |
EDX, [ECX].MyList.LENGTH_ |
; length |
|
CMP |
EDX, 100 |
|
|
JNB |
ADDITEM9 |
|
|
MOV |
[ECX+4*EDX].MyList.BUFFER, EAX |
||
ADD |
EDX, 1 |
|
|
MOV |
[ECX].MyList.LENGTH_, EDX |
|
|
ADDITEM9: RET |
|
|
|
MyListAddItem ENDP
This method works for member functions and constructors. Destructors and overloaded operators cannot have a variable number of parameters. It is possible to explicitly specify the calling convention for member functions and overloaded operators on most compilers. The only compiler I have come across that doesn't allow this specification is Gnu for Mandrake, which uses the __cdecl convention for member functions and overloaded operators. Thus, you can make assembly code for overloaded operators compatible by specifying the __cdecl convention on all other compilers. Note that the mangled names are changed if you change the specified calling convention, even if the generated code is identical.
Member functions that return an object bigger than 8 bytes are not binary compatible for any calling method because the Microsoft compiler places the this pointer first, while other compilers place the return pointer first. In this case you may return the object through an explicit pointer parameter.
You may want to use the friend function method described on page 18 or use inline assembly to avoid these problems and intricacies for both member functions, constructors, destructors, and overloaded operators.
4.9 Further compiler incompatibilities
There are still incompatibilities that cannot be handled with the methods described in the preceding chapters. Do not expect your assembly code to be compatible with multiple compilers if it contains __fastcall functions, new, delete, global objects with constructors or destructors, or exception handling. The ways of name-mangling standard library functions and system functions may also differ.
4.10 Object file formats
Another compatibility problem stems from differences in the formats of object files. Borland, Symantec and 16-bit Microsoft compilers use the OMF format for object files. Microsoft and Gnu compilers for 32-bit Windows use MS-COFF format, also called PE. The Gnu compiler under Linux, BSD, and similar systems prefers ELF format.
The MASM assembler can produce both OMF and MS-COFF format object files, but not ELF format. It is often possible to translate object files from one format to another. The linker and library manager that come with Microsoft compilers can translate object files from OMF to MS-COFF format. A freeware utility called EMXAOUT1 can translate object files from OMF format to the old a.out format that many Gnu linkers accept. The Gnu objcopy utility is a more versatile tool for converting object formats. Which object formats it can handle depends on the build options. The version of objcopy that comes with the MingW32 package can convert between MS-COFF format and ELF format (Download binutils.xxx.tar.gz from www.mingw.org). With this utility, it is possible to use the same assembly module with several different compilers under several different operating systems.
More details about object file formats can be found in the book "Linkers and Loaders" by J. R. Levine (Morgan Kaufmann Publ. 2000).
4.11 Using MASM under Linux
The Gnu assembler that comes with Linux, BSD and similar operating systems uses the terrible AT&T syntax. As I am in favor of standardizing assembly syntax, I will recommend using the MASM assembler under Linux.
Available tools for converting assembly files in MASM syntax to AT&T syntax are not reliable, but tools for converting object files seem to work well. You may assemble your code with MASM under Windows or under Linux using Wine, the Windows emulator. Let MASM generate object files in MS-COFF format and convert them to ELF format using the objcopy utility from MingW32 mentioned above. The object files in ELF format can then be linked into a C++ project using g++ or ld. Under Linux, the process goes like this:
wine -- ml.exe /c /Cx /coff myfile.asm
wine -- objcopy.exe -Oelf32-i386 myfile.obj myfile.o g++ somefile.cpp myfile.o
You may include this sequence in a make script or shell script. It may seem awkward to use the MING version of objcopy which needs Wine to run under Linux. It is probably possible to rebuild the native objcopy utility to add support for the MS-COFF format (called pe-i386 in this context), but I haven't figured out how.
If you have leading underscores on your function names then add the option --remove- leading-char to the objcopy command line.
If you want to build a function library that can be used under several different operating systems, then make a .lib file under Windows using the lib.exe utility that comes with Microsoft compilers and convert the .lib file to ELF format with the command
objcopy -Oelf32-i386 --remove-leading-char myfile.lib myfile.a
The library myfile.a can then be used under Linux, BSD, UNIX, etc. Make sure your library doesn't call any system functions.
4.12 Object oriented programming
As explained on page 5, object oriented programming principles may be required for the sake of clarity and maintainability of a software project. Object oriented programming means classes containing member data (properties) and member functions (methods). You may expect this extra complexity to slow down program performance, but this is not necessarily the case.
Well-designed C++ member functions are expected to access no other data than their member data and parameters. The function parameters are stored on the stack, and so are the member data if the object is declared locally (automatic) inside some other function. This means that all the data can be kept together within a small area of memory so that data caching will be very efficient. A further advantage is that you avoid using 32-bit addresses for global data. This makes the code more compact so that it takes less space in the code cache or trace cache.
The disadvantage of using member functions, in terms of performance, is that the this pointer has to be transferred to the member function as an extra parameter. The register that is used for the this pointer might otherwise have been used for other purposes. These disadvantages may outweigh the advantages for small member functions, but not for bigger member functions.
It is not necessary to translate all the member functions of a C++ class to assembly language. It is possible to make the most critical member function in assembly and leave the rest of the member functions, including constructor and destructor, in C++.
Different compilers for the x86 platform are not compatible on object-oriented code. As explained on page 17ff, you have the choice between several different strategies for overcoming this problem:
1.use inline assembly inside C++ code
2.make assembly modules containing simple functions rather than member functions
3.make compiler-independent code using friend functions, etc.
4.make compiler-specific code and add support for several compilers in the assembly module if necessary
5.make the entire program in assembly (not recommended except for extremely simple programs)
4.13Other high level languages
If you are using other high-level languages than C++, and the compiler manual has no information on how to link with assembly, then see if the manual has any information on how to link with C or C++ modules. You can probably find out how to link with assembly from this information.
In general, it is preferred to use simple functions without name mangling, compatible with the extern "C" and __cdecl or __stdcall conventions in C++. This will work with most compiled languages. Arrays and strings may be implemented differently in different languages.
Many modern programming languages such as C# and Visual Basic.net cannot link to .obj and .lib files. You have to encapsulate your assembly code in a dynamic link library (DLL) in order to be able to call it from these languages. Instructions on how to make a DLL from assembly code can be found in Iczelion's tutorial.s
To call assembly code from Java, you have to compile the code to a DLL and use the Java Native Interface (JNI).