Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf
Скачиваний:
12
Добавлен:
23.08.2013
Размер:
814.91 Кб
Скачать

INC

EAX

;

1

byte

MOV

[EBX+mem3-mem2],EAX

;

3

bytes

SUB

EAX,101

;

3

bytes

LEA

EBX,[EAX+50]

;

3

bytes

You may also consider that different instructions have different lengths. The following instructions take only one byte and are therefore very attractive: PUSH reg, POP reg, INC reg32, DEC reg32. INC and DEC with 8 bit registers take 2 bytes, so INC EAX is shorter than INC AL.

XCHG EAX,reg is also a single-byte instruction and thus takes less space than MOV EAX,reg, but it is slower.

Some instructions take one byte less when they use the accumulator than when they use any other register. Examples:

MOV

EAX,DS:[100000]

is smaller than

MOV

EBX,DS:[100000]

ADD

EAX,1000

is smaller than

ADD

EBX,1000

Instructions with pointers take one byte less when they have only a base pointer (not ESP) and a displacement than when they have a scaled index register, or both base pointer and index register, or ESP as base pointer. Examples:

MOV

EAX,[array][EBX]

is smaller than

MOV

EAX,[array][EBX*4]

MOV

EAX,[EBP+12]

is smaller than

MOV

EAX,[ESP+12]

Instructions with EBP as base pointer and no displacement and no index take one byte more than with other registers:

MOV

EAX,[EBX]

is smaller than

MOV EAX,[EBP], but

MOV

EAX,[EBX+4]

is same size as

MOV EAX,[EBP+4].

Instructions with a scaled index pointer and no base pointer must have a four bytes displacement, even when it is 0:

LEA EAX,[EBX+EBX] is shorter than LEA EAX,[2*EBX].

7 Detecting processor type

What is optimal for one microprocessor may not be optimal for another. Therefore, you may make the most critical part of your program in different versions, each optimized for a specific microprocessor, and select the desired version at run time after detecting which microprocessor the program is running on. The CPUID instruction tells which instructions the microprocessor supports. If you are using instructions that are not supported by all microprocessors, then you must first check if the program is running on a microprocessor that supports these instructions. If your program can benefit significantly from using Single- Instruction-Multiple-Data (SIMD) instructions, then you may make one version of a critical part of the program that uses these instructions, and another version which does not and which is compatible with old microprocessors.

I have provided a library of subroutines that check the processor type and determine which instructions are supported. This can be downloaded from www.agner.org/assem/asmlib.zip. These subroutines can be called from assembly as well as from high-level language.

Obviously, it is recommended to store the output from such a subroutine rather than calling it again each time the information is needed.

For assemblers that don't support the newest instruction set, you may use the macros at www.agner.org/assem/macros.zip.

7.1 Checking for operating system support for XMM registers

Unfortunately, the information that can be obtained from the CPUID instruction is not sufficient for determining whether it is possible to use the SSE and SSE2 instructions, which use the 128-bit XMM registers. The operating system has to save these registers during a task switch and restore them when the task is resumed. The microprocessor can disable the use of the XMM registers in order to prevent their use under old operating systems that do not save these registers. Operating systems that support the use of XMM registers must set bit 9 of the control register CR4 to enable the use of XMM registers and indicate its ability to save and restore these registers during task switches. (Saving and restoring registers is actually faster when XMM registers are enabled).

Unfortunately, the CR4 register can only be read in privileged mode. Application programs therefore have a serious problem determining whether they are allowed to use the XMM registers or not. According to official Intel documents, the only way for an application program to determine whether the operating system supports the use of XMM registers is to try to execute an XMM instruction and see if you get an invalid opcode exception. This is ridiculous, because not all operating systems, compilers and programming languages provide facilities for application programs to catch invalid opcode exceptions. The advantage of using the XMM registers evaporates completely if you have no way of knowing whether you can use these registers without crashing your software.

These serious problems led me to search for an alternative way of checking if the operating system supports the use of XMM registers, and fortunately I have found a way that works reliably. If XMM registers are enabled, then the FXSAVE and FXRSTOR instructions can read and modify the XMM registers. If XMM registers are disabled, then FXSAVE and FXRSTOR cannot access these registers. It is therefore possible to check if XMM registers are enabled, by trying to read and write these registers with FXSAVE and FXRSTOR. The subroutines in www.agner.org/assem/asmlib.zip use this method. These subroutines can be called from assembly as well as from high-level languages, and provide an easy way of detecting whether XMM registers can be used.

In order to verify that this detection method works correctly with all microprocessors, I first checked various manuals. The 1999 version of Intel's software developer's manual says about the FXRSTOR instruction: "The Streaming SIMD Extension fields in the save image (XMM0-XMM7 and MXCSR) will not be loaded into the processor if the CR4.OSFXSR bit is not set." AMD's Programmer’s Manual says effectively the same. However, the 2003 version of Intel's manual says that this behavior is implementation dependent. In order to clarify this, I contacted Intel Technical Support and got the reply, "If the OSFXSR bit in CR4 in not set, then XMMx registers are not restored when FXRSTOR is executed". They further confirmed that this is true for all versions of Intel microprocessors and all microcode updates. I regard this as a guarantee from Intel that my detection method will work on all Intel microprocessors. We can rely on the method working correctly on AMD processors as well since the AMD manual is unambiguous on this question. It appears to be safe to rely on this method working correctly on future microprocessors as well, because any microprocessor that deviates from the above specification would introduce a security problem as well as failing to run existing programs. Compatibility with existing programs is of great concern to microprocessor producers.

The subroutines in www.agner.org/assem/asmlib.zip are constructed so that the detection will give a correct answer unless FXSAVE and FXRSTOR are both buggy. My detection method has been further verified by testing on many different versions of Intel and AMD processors and different operating systems (Test program available at www.agner.org/assem/xmmtest.zip).