Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf
Скачиваний:
12
Добавлен:
23.08.2013
Размер:
814.91 Кб
Скачать

These methods all have lower latencies than the register-to-register moves. However, a drawback of these tricks is that they use port 1 which is also used for all calculations on these registers. If port 1 is saturated, then it may be better to use the slow moves, which go to port 0.

15.12 Bottlenecks in P4

It is important, when optimizing a piece of code, to find the limiting factor that controls execution speed. Tuning the wrong factor is unlikely to have any beneficial effect. In the following paragraphs, I will explain each of the possible limiting factors. You have to consider each factor in order to determine which one is the narrowest bottleneck, and then concentrate your optimization effort on that factor until it is no longer the narrowest bottleneck. As explained before, you have to concentrate on only the most critical part of your program - usually the innermost loop.

Memory access

If you are accessing large amounts of data, or if your data are scattered around everywhere in the memory, then you will have many data cache misses. Accessing uncached data is so time consuming that all other optimization considerations are unimportant. The caches are organized as aligned lines of 64 bytes each. If one byte within an aligned 64-bytes block has been accessed, then you can be certain that all 64 bytes will be loaded into the level-1 data cache and can be accessed at no extra cost. To improve caching, it is recommended that data that are used in the same part of the program be stored together. You may align large arrays and structures by 64. Store local variables on the stack if you don't have enough registers.

The level-1 data cache is only 8 kb on the P4. This may not be enough to hold all your data, but the level-2 cache is more efficient on the P4 than on previous processors. Fetching data from the level-2 cache will cost you only a few clock cycles extra.

Data that are unlikely to be cached may be prefetched before they are used. If memory addresses are accessed consecutively, then they will be prefetched automatically. You should therefore preferably organize your data in a linear fashion so that they can be accessed consecutively, and access no more than four large arrays, preferably less, in the critical part of your program.

The PREFETCH instructions can improve performance in situations where you access uncached data and cannot rely on automatic prefetching. However, excessive use of the PREFETCH instructions can slow down program throughput. If you are in doubt whether a PREFETCH instruction will benefit your program, then you may simply load the data needed into a spare register rather than using a PREFETCH instruction. If you have no spare register then use an instruction which reads the memory operand without changing any register, such as CMP or TEST. As the stack pointer is unlikely to be part of a critical dependence chain, a useful way to prefetch data is CMP ESP,[MEM], which will change only the flags.

When writing to a memory location that is unlikely to be accessed again soon, you may use the non-temporal write instructions MOVNTI, etc., but excessive use of non-temporal moves will slow down performance.

Not only data, but also code, should be arranged for optimal caching. Subroutines that are used in the same part of the program should preferably be stored together in the same memory address range, rather than be scattered around at different addresses. Put seldom used branches away in the bottom of your code. Make sure the critical part of your program is not too big for the trace cache.

Further guidelines regarding memory access can be found in "Intel Pentium 4 and Intel Xeon Processor Optimization Reference Manual".

Execution latency

The executing time for a dependence chain can be calculated from the latencies listed in the tables starting on page147. Most instructions have an additional latency of ½ or 1 clock cycle when the subsequent instruction goes to a different execution unit. See page 86 for further explanation.

The longest dependence chains occur in loops where each iteration depends on the result of the preceding one. Such loops can often be improved by handling data in parallel. Use multiple accumulators or SIMD instructions to handle data in parallel (see page 34).

If long dependence chains limit the performance of your program then you may improve performance by choosing instructions with low latency, minimizing the number of transitions between execution units, breaking up dependence chains, and utilizing all opportunities for calculating subexpressions in parallel.

Always avoid memory intermediates in dependence chains, as explained on page 90.

Execution unit throughput

If your dependence chains are short, or if you are working on several dependence chains in parallel, then your program may be limited by throughput rather than latency. Different execution units have different throughputs. Alu0 and alu1, which handle simple integer instructions and other common uops, both have a throughput of 2 instructions per clock cycle. Most other execution units have a throughput of one instruction per clock cycle. When working with 128-bit registers, the throughput is usually one instruction per two clock cycles. Division and square roots have the lowest throughputs. The instruction list on page 147 indicates reciprocal throughputs for all instructions on the P4 processor. Each throughput measure applies to all uops executing in the same execution subunit (see page 83).

If execution throughput limits your code then try to move some calculations to other execution subunits.

Port throughput

Each of the execution ports can receive one uop per clock cycle. Port 0 and port 1 can receive an additional uop at each half-clock tick if these uops go to the double-speed units alu0 and alu1. If all uops in the critical part of your code go to the single-speed and halfspeed units under port 1, then the throughput will be limited to 1 uop per clock cycle. If the uops are optimally distributed between the four ports, then the throughput may be as high as 6 uops per clock cycle. Such a high throughput can only be achieved in short bursts, because the trace cache and the retirement station limit the average throughput to 3 uops per clock cycle.

If port throughput limits your code then try to move some uops to other ports.

Trace cache delivery

The trace cache can deliver a maximum of 3 uops per clock cycle. Some uops require more than one trace cache entry, as explained on page 76. The delivery rate can be less than 3 uops per clock cycle for code that contains many branches and for tiny loops with branches inside (see page 79).

If none of the abovementioned factors limit program performance, then you may aim at a throughput of 3 uops per clock cycle.

Choose the instructions that generate the smallest number of uops. Avoid uops that require more than one trace cache entry, and avoid having an odd number of single-size uops between any two double-size uops (see page 76).