Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf
Скачиваний:
12
Добавлен:
23.08.2013
Размер:
814.91 Кб
Скачать

This code is not portable, and will work only on Intel-compatible microprocessors. The round function is also available in the function library at www.agner.org/assem/asmlib.zip.

The P3 and P4 processors have fast truncation instructions, but these instructions are not compatible with previous microprocessors and can therefore only be used in code that is written exclusively for these microprocessors.

Conversion of unsigned integers to floating-point numbers is also slow. Use signed integers for efficient conversion to float.

3.10 Character arrays versus string objects

Modern C++ libraries define a class named string or CString which facilitates the manipulation of text strings. These classes use dynamic memory allocation and are much less efficient than the old method of using character arrays for text strings. If you don't know how to do this, then find the explanation in an old C++ textbook, or study the documentation for the functions strcpy, strncpy, strcat, strlen, strcmp, sprintf. But character arrays are not protected against overflow. If security is important then use string objects. If speed is important then use character arrays and remember that it is your own responsibility that the length of a string never exceeds the length of the array minus 1.

4 Combining assembly and high level language

Before you start to code a function in assembly language, you should code it in C++, using the optimization guidelines given in the previous chapter (page 8). Only the most critical part of your program needs to be optimized using assembly language.

4.1 Inline assembly

The simplest way to combine C++ and assembly language is to inert inline assembly in the C++ code. See the compiler manual for syntax details.

Note that not all registers can be used freely in inline assembly. To be safe, avoid modifying EBP, EBX and ESP. It is recommended to let the C++ compiler make an assembly file so you can check if the inline assembly code interfaces correctly with the surrounding C++ code and that no reserved register is modified without saving. See the chapters below on register usage.

The C++ compiler may interpret the most common assembly instructions using a built-in assembler. But in many cases the compiler needs to translate all the surrounding C++ code to assembly and run everything through an assembler. It may be possible to specify which assembler to use for inline assembly, but the assembler must be compatible with the assembly generated by the compiler as well as with the inline assembly.

If you are using the Gnu compiler, you have to use the primitive AT&T syntax for inline assembly or move the assembly code to a separate module.

An alternative to inline assembly is to make entire functions in separate assembly language modules. This gives better control of register use and function prolog and epilog code. The following chapters give more details on how to make assembly language modules that can be linked into high level language programs.

4.2 Calling conventions

An application binary interface (ABI) is a set of standards for programs running under a particular system. When linking assembly language modules together with modules written

in other programming languages, it is essential that your assembly code conform to all standards. It is possible to use your own standards for assembly procedures that are called only from other assembly procedures, but it is highly recommended to follow as many of the existing standards as possible. On the 32-bit Intel x86-compatible platform, there are several different conventions for transferring parameters to procedures:

calling convention

parameter order on stack

parameters removed by

__cdecl

first par. at low address

caller

__stdcall

first par. at low address

subroutine

__fastcall

compiler specific

subroutine

_pascal

first par. at high address

subroutine

member function

compiler specific

compiler specific

The __cdecl calling convention is the default for C and C++ functions, while __stdcall is the default for system functions. Statically linked modules in .obj and .lib files should preferably use __cdecl, while dynamic link libraries in .dll files should use __stdcall.

Remember that the stack pointer is decreased when a value is pushed on the stack. This means that the parameter pushed first will be at the highest address, in accordance with the _pascal convention. You must push parameters in reverse order to satisfy the __cdecl and __stdcall conventions.

The __fastcall convention allows parameters to be transferred in registers. This is considerably faster, especially on the P4. Unfortunately, the __fastcall convention is different for different compilers. You may improve execution speed by using registers for parameter transfer on assembly procedures that are called only from other assembly language procedures.

4.3 Data storage in C++

Variables and objects that are declared inside a function in C++ will be stored on the stack and addressed by ESP or EBP. This is the most efficient way of storing data, for two reasons. Firstly, the stack space used for local storage is released when the function returns and may be reused by the next function that is called. Using the same memory area repeatedly improves data caching. The second reason is that data stored on the stack can often be addressed with an 8-bit offset relative to a pointer rather than the 32 bits required for addressing data in the data segment. This makes the code more compact so that it takes less space in the code cache or trace cache.

Global and static data in C++ are stored in the data segment and addressed with 32-bit absolute addresses. A third way of storing data in C++ is to allocate space with new or malloc. This method should be avoided if speed is critical.

The following example shows a simple C++ function with local data stored on the stack, and the same function translated to assembly. (Calling details will be explained on page 17 below).

; Example 4.1

extern "C" double SinPlusPow (double a, double b, double c) { double x, y;

x = sin(a);

y = pow(b,c); return x + y;}

Same in assembly, with __cdecl calling convention:

_SinPlusPow PROC NEAR

SMAP

STRUC

; make a map of data on stack

CALLPARM1

DQ

?

; parameter 1

for call to sin and pow

CALLPARM2

DQ

?

; parameter 2

for call to pow

X

DQ

?

; local

variable

X

Y

DQ

?

; local

variable

Y

RETURNADDR

DD

?

; return address

for _SinCosPlusOne

A

DQ

?

; parameters for

_SinCosPlusOne

B

DQ

?

 

 

 

 

C_

DQ

?

; (C is

a reserved word in MASM 6)

SMAP

 

ENDS

 

 

 

; compute space required for data not

already

on stack:

LOCALDATASPACE = SMAP.RETURNADDR - SMAP.CALLPARM1

 

SUB

ESP, LOCALDATASPACE

;

make space for local data

 

FLD

[ESP].SMAP.A

;

load a

 

 

FSTP

[ESP].SMAP.CALLPARM1

;

store a

on top of stack

 

CALL

_sin

;

_sin reads parameter CALLPARM1

 

FSTP

[ESP].SMAP.X

;

store x

 

 

FLD

[ESP].SMAP.B

;

load b

 

 

FSTP

[ESP].SMAP.CALLPARM1

;

store b

on top of stack

 

FLD

[ESP].SMAP.C_

;

load c

 

 

FSTP

[ESP].SMAP.CALLPARM2 ;

store c

next on stack

 

CALL

_pow

;_pow reads CALLPARM1 and CALLPARM2

 

FST

[ESP].SMAP.Y

;

store y

 

 

FADD

[ESP].SMAP.X

;

x + y

 

 

ADD

ESP, LOCALDATASPACE

;

release

local data space

 

RET

 

;

return value in ST(0)

_SinPlusPow ENDP

 

 

 

PUBLIC

_SinPlusPow

;

public function

EXTRN

_sin:near, _pow:near

;

external functions

In this function, we are allocating space for local data by subtracting the size of CALPARM1, CALLPARM2, X and Y from the stack pointer. The stack pointer must be restored to its original value before the RET. Before calling the functions _sin and _pow, we must place the function parameters for these calls at the right place relative to the current value of ESP. Therefore, we have placed CALLPARM1 and CALLPARM2 at the beginning of STACKMAP. It is more common to push the parameters before a function call and pop the stack after the call, but this method is faster. We are assuming here that the _sin and _pow functions use the __cdecl calling convention so that ESP still points to CALLPARM1 after the call. Therefore, we don't need to adjust the stack pointer between the two calls.

Remember, when using ESP as a pointer, that the value of ESP is changed every time you have a PUSH or POP. If you are using simplified function directives (MASM 6.x syntax), such as:

SinPlusPow PROC NEAR C, a:REAL8, b:REAL8, c:REAL8

then you have an implicit PUSH EBP in the prolog code which you must include in your stack map. You may use .LISTALL to see the prolog code. Remember, also, that the size of the stack map must be a multiple of 4.

This function could be further optimized. We might use integer registers for moving A, B and C; we don't need to storeY; and we might use the FSIN instruction rather than calling the external function _sin. The purpose of the above example is just to show how data are stored and transferred on the stack.

If your assembly code contains many calls to high-level language functions or system functions, then you are in all likelihood optimizing the wrong part of your program. The critical innermost loop where most of the CPU time is used should be placed in a separate function that does not call any other functions.

4.4 Register usage in 16 bit mode DOS or Windows

Function parameters are passed on the stack according to the calling conventions listed on page 13. Parameters of 8 or 16 bits size use one word of stack space. Parameters bigger than 16 bits are stored in little-endian form, i.e. with the least significant word at the lowest address.

Function return values are passed in registers in most cases. 8-bit integers are returned in AL, 16-bit integers and near pointers in AX, 32-bit integers and far pointers in DX:AX, Booleans in AX, and floating-point values in ST(0).

Registers AX, BX, CX, DX, ES and arithmetic flags may be changed by the procedure. All other registers must be saved and restored. A procedure can rely on SI, DI, BP, DS and SS being unchanged across a call to another procedure. The high word of ESP cannot be used because it is modified by interrupts and task switches.

4.5 Register usage in 32 bit Windows

Function parameters are passed on the stack according to the calling conventions listed on page 13. Parameters of 32 bits size or less use one DWORD of stack space. Parameters bigger than 32 bits are stored in little-endian form, i.e. with the least significant DWORD at the lowest address, and DWORD aligned.

Function return values are passed in registers in most cases. 8-bit integers are returned in AL, 16-bit integers in AX, 32-bit integers, pointers, and Booleans in EAX, 64-bit integers in EDX:EAX, and floating-point values in ST(0). Structures and class objects not exceeding 64 bits size are returned in the same way as integers, even if the structure contains floating point values. Structures and class objects bigger than 64 bits are returned through a pointer passed to the function as the first parameter and returned in EAX. Compilers that don't support 64-bit integers may return structures bigger than 32 bits through a pointer. The Borland compiler also returns structures through a pointer if the size is not a power of 2.

Registers EAX, ECX and EDX may be changed by a procedure. All other general-purpose registers (EBX, ESI, EDI, EBP) must be saved and restored if they are used. The value of ESP must be divisible by 4 at all times, so don't push 16-bit data on the stack. Segment registers cannot be changed, not even temporarily. CS, DS, ES, and SS all point to the flat segment group. FS is used for a thread environment block. GS is unused, but reserved. Flags may be changed by a procedure with the following restrictions: The direction flag is 0 by default. The direction flag may be set temporarily, but must be cleared before any call or return. The interrupt flag cannot be cleared. The floating-point register stack is empty at the entry of a procedure and must be empty at return, except for ST(0) if it is used for return value. MMX registers may be changed by the procedure and if so cleared by EMMS before returning and before calling any other procedure that may use floating-point registers. All XMM registers can be modified by procedures. Rules for passing parameters and return values in XMM registers are described in Intel's application note AP 589 "Software Conventions for Streaming SIMD Extensions". A procedure can rely on EBX, ESI, EDI, EBP and all segment registers being unchanged across a call to another procedure.

4.6 Register usage in Linux

The rules for register usage in Linux appear to be almost the same as for 32-bit windows. Registers EAX, ECX, and EDX may be changed by a procedure. All other general-purpose registers must be saved. There appears to be no rule for the direction flag. Function return values are transferred in the same way as under Windows. Calling conventions are the same, except for the fact that no underscore is prefixed to public names. I have no information about the use of FS and GS in Linux. It is not difficult to make an assembly