- •Introduction
- •Assembly language syntax
- •Microprocessor versions covered by this manual
- •Getting started with optimization
- •Speed versus program clarity and security
- •Choice of programming language
- •Choice of algorithm
- •Memory model
- •Finding the hot spots
- •Literature
- •Optimizing in C++
- •Use optimization options
- •Identify the most critical parts of your code
- •Break dependence chains
- •Use local variables
- •Use array of structures rather than structure of arrays
- •Alignment of data
- •Division
- •Function calls
- •Conversion from floating-point numbers to integers
- •Character arrays versus string objects
- •Combining assembly and high level language
- •Inline assembly
- •Calling conventions
- •Data storage in C++
- •Register usage in 16 bit mode DOS or Windows
- •Register usage in 32 bit Windows
- •Register usage in Linux
- •Making compiler-independent code
- •Adding support for multiple compilers in .asm modules
- •Further compiler incompatibilities
- •Object file formats
- •Using MASM under Linux
- •Object oriented programming
- •Other high level languages
- •Debugging and verifying assembly code
- •Reducing code size
- •Detecting processor type
- •Checking for operating system support for XMM registers
- •Alignment
- •Cache
- •First time versus repeated execution
- •Out-of-order execution (PPro, P2, P3, P4)
- •Instructions are split into uops
- •Register renaming
- •Dependence chains
- •Branch prediction (all processors)
- •Prediction methods for conditional jumps
- •Branch prediction in P1
- •Branch prediction in PMMX, PPro, P2, and P3
- •Branch prediction in P4
- •Indirect jumps (all processors)
- •Returns (all processors except P1)
- •Static prediction
- •Close jumps
- •Avoiding jumps (all processors)
- •Optimizing for P1 and PMMX
- •Pairing integer instructions
- •Address generation interlock
- •Splitting complex instructions into simpler ones
- •Prefixes
- •Scheduling floating-point code
- •Optimizing for PPro, P2, and P3
- •The pipeline in PPro, P2 and P3
- •Register renaming
- •Register read stalls
- •Out of order execution
- •Retirement
- •Partial register stalls
- •Partial memory stalls
- •Bottlenecks in PPro, P2, P3
- •Optimizing for P4
- •Trace cache
- •Instruction decoding
- •Execution units
- •Do the floating-point and MMX units run at half speed?
- •Transfer of data between execution units
- •Retirement
- •Partial registers and partial flags
- •Partial memory access
- •Memory intermediates in dependencies
- •Breaking dependencies
- •Choosing the optimal instructions
- •Bottlenecks in P4
- •Loop optimization (all processors)
- •Loops in P1 and PMMX
- •Loops in PPro, P2, and P3
- •Loops in P4
- •Macro loops (all processors)
- •Single-Instruction-Multiple-Data programming
- •Problematic Instructions
- •XCHG (all processors)
- •Shifts and rotates (P4)
- •Rotates through carry (all processors)
- •String instructions (all processors)
- •Bit test (all processors)
- •Integer multiplication (all processors)
- •Division (all processors)
- •LEA instruction (all processors)
- •WAIT instruction (all processors)
- •FCOM + FSTSW AX (all processors)
- •FPREM (all processors)
- •FRNDINT (all processors)
- •FSCALE and exponential function (all processors)
- •FPTAN (all processors)
- •FSQRT (P3 and P4)
- •FLDCW (PPro, P2, P3, P4)
- •Bit scan (P1 and PMMX)
- •Special topics
- •Freeing floating-point registers (all processors)
- •Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
- •Converting from floating-point to integer (All processors)
- •Using integer instructions for floating-point operations
- •Using floating-point instructions for integer operations
- •Moving blocks of data (All processors)
- •Self-modifying code (All processors)
- •Testing speed
- •List of instruction timings for P1 and PMMX
- •Integer instructions (P1 and PMMX)
- •Floating-point instructions (P1 and PMMX)
- •MMX instructions (PMMX)
- •List of instruction timings and uop breakdown for PPro, P2 and P3
- •Integer instructions (PPro, P2 and P3)
- •Floating-point instructions (PPro, P2 and P3)
- •MMX instructions (P2 and P3)
- •List of instruction timings and uop breakdown for P4
- •integer instructions
- •Floating-point instructions
- •SIMD integer instructions
- •SIMD floating-point instructions
- •Comparison of the different microprocessors
The name Celeron applies to Pentium II and later models with less cache than the standard versions. The name Xeon applies to Pentium II and later models with more cache than the standard versions.
The P1 and PMMX processors represent the fifth generation in the Intel x86 series of microprocessors, and their processor kernels are very similar. PPro, P2 and P3 all have the sixth generation kernel. These three processors are almost identical except for the fact that new instructions are added to each new model. P4 is the first processor in the seventh generation which, for obscure reasons, is not called seventh generation in Intel documents. Quite unexpectedly, the generation number returned by the CPUID instruction in the P4 is not 7 but 15. The reader should be aware that the 5'th, 6'th and 7'th generation microprocessors behave very differently. What is optimal for one generation may not be optimal for the others.
2 Getting started with optimization
2.1 Speed versus program clarity and security
Current trends in software technology go in the direction of ever more abstract and highlevel programming techniques and languages. The motivations behind this trend are: faster development, easier maintenance, and safer code. A typical programmer spends more time finding errors and making additions and modifications than on writing new code. Therefore, most software is written in high-level languages that are easier to document and maintain. The backside of the coin is that the code gets slower and the demands on hardware performance gets bigger and bigger, as the ever more complex intermediate layers
separate the programmer's code from the hardware. Large runtime modules, emulators and virtual machines consume large amounts of hard disk space and RAM memory. The result is that the programs take long time to install, long time to load, and long time to execute.
At the opposite extreme, we have assembly language, which produces very compact and fast code, but is very difficult to debug and maintain and is very vulnerable to programming errors.
A good compromise is provided by the C++ programming language. C++ has all the advanced features of a high-level language, but it has also inherited the low-level features of the old C language. You can use the most advanced high-level programming techniques in most of your software project for reasons of maintainability and security, and still have access to use low-level techniques in the innermost loop where speed is critical.
The security problems of low-level programming should not be ignored, however. Many of the software crashes and security holes that plague contemporary software are due to the unsafe features that C++ has inherited from C, such as absence of array bounds checking, uninitialized pointers, pointer arithmetic, pointer type casting, and memory leaks. Some programmers prefer to use other programming languages to avoid these problems, but most of the security problems in C++ can be avoided by using safer programming techniques. Some good advices for safe C++ programming are:
•use references rather than pointers
•use string objects rather than character arrays
•use container classes rather than arrays (useful container classes are provided in the standard template library)
•avoid dynamic memory allocation (new, delete) except in well-tested container classes
•avoid functions that write to parameters through void pointers or variable argument lists, such as memcpy and scanf.
•encapsulate everything in classes with well-defined interfaces and responsibilities
•use systematic testing methods
You may deviate from these advices in critical parts of the code where speed is important, but make sure the unsafe code is limited to well-tested functions or modules with a welldefined interface to the rest of the program.
Assembly language is, of course, even more unsafe and difficult to maintain. Assembly language should therefore be used only in the most critical part of your program, and only if it provides a significant improvement in speed. The assembly code should be confined to a well-tested function, module or library with a well-defined interface to the calling program.
2.2 Choice of programming language
Before starting a new software project, you have to decide which programming language to use. Low-level languages are good for optimizing execution speed or program size, while high-level languages are good for making clear and well-structured code.
Today, most universities teach Java as the first programming language for pedagogical reasons. The advantages of Java are that it is consistent, well structured, and portable. But it is not fast, because in most cases it runs on a virtual Java machine that interprets code rather than executing it. If execution speed is important, then the best choice will be C++. This language has the best of both worlds. The C++ language has more features and options than most other programming languages. Advanced features like inheritance, polymorphism, macros, template libraries and exception handling enable you to make wellstructured and reusable code at a high level of abstraction. On the other hand, the C++ language is a superset of the old C language, which gives you access to fiddle with every bit and byte and to use low-level programming techniques.
C++ is definitely the language of choice if you want to make part of your project in assembly language. C++ has excellent features for integrating with assembly:
•C++ links easily with assembly modules
•C++ uses simple data structures that are also available in assembly
•most C++ compilers can translate from C++ to assembly
•most C++ compilers support inline assembly and direct access to registers and flags
•some C++ compilers have "intrinsic functions" that translate directly to XMM instructions
It is possible to call assembly language modules from other compiled languages such as Pascal, Fortran, Basic and C#, but this is usually more complicated than with C++. Strings and arrays may have to be translated to the appropriate format, and it may be necessary to encapsulate the assembly module into a dynamic link library (DLL).
Combining assembly code with Java is even more difficult because Java is usually not compiled into executable code but to an intermediate code that runs on an emulated virtual Java machine.
See page 23 for details on how to call assembly language modules from various high level languages.
2.3 Choice of algorithm
The first thing to do when you want to optimize a piece of software is to find the best algorithm. Optimizing a poor algorithm is a waste of time. So don't even think of converting your code to assembly before you have explored all possibilities for optimizing your algorithm and the implementation of your algorithm.
2.4 Memory model
The Pentiums are designed primarily for 32-bit code, and the performance is inferior on 16bit code. Segmenting your code and data also degrades performance significantly, so you should generally prefer 32-bit flat mode, and an operating system that supports this mode. The code examples shown in this manual assume a 32-bit flat memory model, unless otherwise specified.
2.5 Finding the hot spots
Before you try to optimize anything, you have to identify the critical parts of your program. Often, more than 99% of the CPU time is spent in the innermost loop of a program. If this is the case then you should isolate this hot spot in a separate subroutine that you can optimize for speed, while the rest of your program can be optimized for clarity and maintainability.
You may translate the critical subroutine to assembly and leave everything else in high-level language. Many assembly programmers waste a lot of energy optimizing the wrong parts of their programs. There are even people who make entire Windows programs in assembly.
Most of the code in a typical program goes to the user interface and to calling system routines. A user interface with menus and dialog boxes is certainly not something that is being executed a thousand times per second. People who try to optimize something like this in assembly may be spending hours - or more likely months - making the program respond ten nanoseconds faster to a mouse click on a system where the screen is refreshed 60 times per second. There are certainly better ways of investing your programming skills! The same applies to program sections that consist mainly of calls to system routines. Such calls are usually well optimized by C++ compilers and there is no reason to use assembly language here.
Assembly language should be used only for loops that are executed so many times that it
9
really matters in terms of CPU time, and that is very many. A 2 GHz Pentium 4 can do 6· 10 integer additions per second. So it is probably not worth the effort to optimize a loop that makes "only" one million integer operations. It will suffice to change from Java to C++.
Typical applications where assembly language can be useful for optimizing speed include processing of sound and images, compression and encryption of large amounts of data, simulation of complex systems, and mathematical calculations that involve iteration. The speed of such applications can sometimes be increased manyfold by conversion to assembly.
Assembly language is also useful when optimizing code for size. This is typically used in embedded systems where a piece of code has to fit into a ROM or flash RAM. Using assembly language for optimizing an application program for size is not worth the effort because data storage is so cheap.
If it is not obvious where the critical parts of your program are then you may use a profiler to find them. If it turns out that the bottleneck is disk access, then you may modify your program to make disk access sequential in order to improve disk caching, rather than turning to assembly programming. If the bottleneck is graphics output then you may look for a way of reducing the number of calls to graphic procedures or a better graphics library.
Some high level language compilers offer relatively good optimization for specific processors, but in most cases further optimization by hand can make it much better. When the possibilities for optimizing in C++ have been exhausted, then you can make your C++ compiler translate the critical subroutine to assembly, and do further optimizations by hand.