Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf
Скачиваний:
12
Добавлен:
23.08.2013
Размер:
814.91 Кб
Скачать

example 16.11. FLD ST(0) plays the same role in example 16.13 as ORPD XMM3,XMM1 in example 16.11.

The repetition count for this loop is the number of significant bits in n. If this value often changes, then you may repeat the loop the maximum number of times in order to make the loop control branch predictable. This requires, of course, that there is no risk of overflow in the multiplications.

Changing the code of example 16.13 to use XMM registers is no advantage, unless you can handle data in parallel, because conditional moves in XMM registers are complicated to implement (see page 110).

16.4 Macro loops (all processors)

If the repetition count for a loop is small and constant, then it is possible to unroll the loop completely. The advantage of this is that calculations that depend only on the loop counter can be done at assembly time rather than at execution time. The disadvantage is, of course, that it takes up more space in the trace cache or code cache.

The MASM language includes a powerful macro language that is useful for this purpose. If, for example, we need a list of square numbers, then the C++ code may look like this:

int squares[10];

for (int i=0; i<10; i++) squares[i] = i*i;

The same list can be generated by a macro loop in MASM language:

; Example 16.14

 

 

 

.DATA

 

 

 

 

squares LABEL DWORD

; label at start of array

I = 0

 

; temporary

counter

REPT

10

; repeat

10

times

DD

I * I

; define

one array element

I

= I + 1

; increment

counter

ENDM

 

; end of

REPT loop

Here, I is a preprocessing variable. The I loop is run at assembly time, not at execution time. The variable I and the statement I = I + 1 never make it into the final code, and hence take no time to execute. In fact, example 16.14 generates no executable code, only data. The macro preprocessor will translate the above code to:

squares LABEL DWORD

; label at start of array

DD

0

 

DD

1

 

DD

4

 

DD

9

 

DD

16

 

DD

25

 

DD

36

 

DD

49

 

DD

64

 

DD

81

 

Now, let's return to the power example (example 16.12). Ifn is known at assembly time, then the power function can be implemented using the following macro:

;This macro will raise two packed double-precision floats in X

;to the power of N, where N is a positive integer constant.

;The result is returned in Y. X and Y must be two different

;XMM registers. X is not preserved.