Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf
Скачиваний:
12
Добавлен:
23.08.2013
Размер:
814.91 Кб
Скачать

On P2 and P3 it is advantageous to use MMX registers for moving 8 bytes at a time if the above conditions are not met and the destination is likely to be in the level 1 cache. The loop may be rolled out by two.

On the P3, the fastest way of moving data is to use the MOVAPS instruction if the conditions on page 114 are not met or if the destination is in the level 1 or level 2 cache:

 

SUB

EDI, ESI

TOP:

MOVAPS

XMM0, [ESI]

 

MOVAPS

[ESI+EDI], XMM0

 

ADD

ESI, 16

 

DEC

ECX

 

JNZ

TOP

On the P3 you also have the option of writing directly to RAM memory without involving the cache by using the MOVNTQ or MOVNTPS instruction. This can be useful if you don't want the destination to go into a cache. MOVNTPS is only slightly faster than MOVNTQ.

On the P4, the fastest way of moving blocks of data is to use MOVDQA. You may use MOVNTDQ if you don't want the destination to be cached, butMOVDQA is often faster. REP MOVSD may still be the best choice for small blocks of data if the block size is varying and a loop would suffer a branch misprediction.

For further advices on improving memory access see the Intel Pentium 4 and Intel Xeon Processor Optimization Reference Manual.

19.7 Self-modifying code (All processors)

The penalty for executing a piece of code immediately after modifying it is approximately 19 clocks for P1, 31 for PMMX, and 150-300 for PPro, P2 and P3. The P4 will purge the entire trace cache after self-modifying code. The 80486 and earlier processors require a jump between the modifying and the modified code in order to flush the code cache.

To get permission to modify code in a protected operating system you need to call special system functions: In 16-bit Windows call ChangeSelector; in 32-bit Windows call

VirtualProtect and FlushInstructionCache (or put the code in a data segment).

Self-modifying code is not considered good programming practice. It should only be used if the gain in speed is substantial and the modified code is executed so many times that the advantage outweighs the penalties for using self-modifying code.

20 Testing speed

The microprocessors in the Pentium family have an internal 64-bit clock counter which can be read into EDX:EAX using the instruction RDTSC (read time stamp counter). This is very useful for measuring exactly how many clock cycles a piece of code takes.

On the PPro, P2, P3 and P4 processors, you have to insert XOR EAX,EAX / CPUID before and after each RDTSC to prevent it from executing in parallel with anything else. CPUID is a serializing instruction, which means that it flushes the pipeline and waits for all pending operations to finish before proceeding. This is very useful for testing purposes.

The RDTSC instruction cannot execute in virtual mode on the P1 and PMMX, so if you are testing DOS programs on these processors you must run in real mode.

The biggest problem when counting clock ticks is to avoid interrupts. Protected operating systems may not allow you to clear the interrupt flag, so you cannot avoid interrupts and task switches during the test. There are several alternative ways to overcome this problem:

1.Run the test code with a high priority to minimize the risk of interrupts and task switches.

2.If the piece of code you are testing is relatively short then you may repeat the test several times and assume that the lowest of the clock counts measured represents a situation where no interrupt has occurred.

3.If the piece of code you are testing takes so long time that interrupts are unavoidable then you may repeat the test many times and take the average of the clock count measurements.

4.Make a virtual device driver to clear the interrupt flag.

5.Use an operating system that allows clearing the interrupt flag (e.g. Windows 98 without network, in console mode).

6.Start the test program in real mode using the old DOS operating system.

My test programs use method 1, 2, 5 and 6. These programs are available at www.agner.org/assem/testp.zip. The test programs that use method 6 set up a segment descriptor table and switch to 32-bit protected mode with the interrupt flag cleared. You can insert the code you want to test into these test programs. You need a bootable disk with Windows 98 or earlier to get access to run the test programs in real mode.

Remember when you are measuring clock ticks that a piece of code always takes longer time the first few times it is executed where it is not in the code cache or trace cache. Furthermore, it may take three iterations before the branch predictor has adapted to the code.

The alignment effects on the PPro, P2 and P3 processors make time measurements very difficult on these processors. Assume that you have a piece code and you want to make a change which you expect to make the code a few clocks faster. The modified code does not have exactly the same size as the original. This means that the code below the modification will be aligned differently and the instruction fetch blocks will be different. If instruction fetch and decoding is a bottleneck, which is often the case on these processors, then the change in the alignment may make the code several clock cycles faster or slower. The change in the alignment may actually have a larger effect on the clock count than the modification you have made. So you may be unable to verify whether the modification in itself makes the code faster or slower. It can be quite difficult to predict where each instruction fetch block begins, as explained on page 62.

The P1, PMMX and P4 processors do not have these alignment problems. The P4 does, however, have a somewhat similar, though less severe, effect. This effect is caused by changes in the alignment of uops in the trace cache. The time it takes to jump to the least common (but predicted) branch after a conditional jump instruction may differ by up to two clock cycles on different alignments if trace cache delivery is the bottleneck. The alignment of uops in the trace cache lines is difficult to predict (see page 79).

The processors in the Pentium family have special performance monitor counters which can count events such as cache misses, misalignments, branch mispredictions, etc. You need privileged access to set up these counters. The performance monitor counters are model specific. This means that you must use a different test setup for each microprocessor model. Details about how to use the performance monitor counters can be found in Intel's Software Developer's Manuals.

The test programs at www.agner.org/assem/testp.zip give access to the performance monitor counters when run under real mode DOS.