Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf
Скачиваний:
12
Добавлен:
23.08.2013
Размер:
814.91 Кб
Скачать

2.6 Literature

A lot of useful literature can be downloaded for free from Intel's web site or acquired in print or on CD-ROM. It is recommended that you study this literature in order to get acquainted with the microprocessor instruction set. However, the documents from Intel are not always accurate. Especially the first Pentium tutorials had many errors.

The most important manuals are "Intel Pentium 4 and Intel Xeon Processor Optimization Reference Manual", and "IA-32 Intel Architecture Software Developer's Manual, Volume 2: Instruction Set Reference". I will not give the URL's here because the file locations change very often. You can find the documents you need by using the search facilities at: developer.intel.com or follow the links from www.agner.org/assem.

VTUNE is a software tool from Intel for optimizing code. I have not tested it and can therefore not give any evaluation of it here.

A lot of other sources than Intel also have useful information. These sources are listed in the FAQ for the newsgroup comp.lang.asm.x86. For other internet resources follow the links from www.agner.org/assem.

Some useful textbooks are "Introduction to 80x86 Assembly Language and Computer Architecture" by R. C. Detmer, 2001; and "Computer Architecture: A Quantitative Approach" by J. L. Hennessy and D. A. Patterson, 3'rd ed. 2002.

3 Optimizing in C++

Here, I will give you some general advices for improving the speed of C++ code. Most of these advices apply to other compiled programming languages as well.

3.1 Use optimization options

Study the optimization options provided by your compiler and use the ones that are applicable to your project. Turn off all debugging options when you are finished using them and want to make the final distributable code.

3.2 Identify the most critical parts of your code

In computation-intensive software programs, you will often find that 99% of the CPU time is used in the innermost loop. Identifying the most critical part of your software is therefore necessary if you want to improve the speed of computation. Optimizing less critical parts of your code will not only be a waste of time, it also makes your code less clear, and less easy to debug and maintain. If it is not obvious which part of your code is most critical, then you may use a profiler. If you don't have a profiler, then set up a number of counter variables that are incremented at different places in your code to see which part is executed most times.

Study the algorithm used in the critical part of your code and see if it can be improved. Often you can gain more speed simply by optimizing the algorithm than by any other optimization method.

3.3 Break dependence chains

Modern microprocessors can do out-of-order execution. This means that if a piece of software specifies the calculation of A and then B, and the calculation of A is slow, then the microprocessor can begin the calculation of B before the calculation of A is finished. Obviously, this is only possible if the value of A is not needed for the calculation of B.

In order to take advantage of out-of-order execution, you have to avoid long dependence chains. A dependence chain is a series of calculations, where each calculation depends on the result of the preceding one. Consider the following example, which calculates the sum of 100 numbers:

double list[100], sum = 0.;

for (int i = 0; i < 100; i++) sum += list[i];

This is a long dependence chain. If a floating-point addition takes 5 clock cycles, then this loop will take approximately 500 clock cycles. You can improve the performance dramatically by splitting the dependence chain in two:

double list[100], sum1 = 0., sum2 = 0.; for (int i = 0; i < 100; i += 2) {

sum1 += list[i]; sum2 += list[i+1];}

sum1 += sum2;

If the microprocessor is doing an addition to sum1 from time T to T+5, then it can do another addition to sum2 from time T+1 to T+6, and the whole loop will take only 256 clock cycles.

3.4 Use local variables

Variables and objects that are declared inside a function, and not static, will be stored on the stack. The same applies to function parameters. The memory space occupied by these variables is released when the function returns, and can be reused by the next function. Using the same memory space again and again makes the caching of memory more efficient. Unless you have very big arrays and objects on your stack, you can be almost certain that your local variables are in the level-1 cache inside the microprocessor, from where they can be accessed many times faster than other parts of the memory. Static and global variables and objects are stored at a fixed place in memory, and are less likely to be cached.

If you need global variables then you may make these variables part of a class and access them through member functions. This may save space in the level-2 cache and the trace cache because addresses relative to the 'this' pointer can be expressed with an 8-bit or 16-bit offset, while absolute addresses require 32 bits. The drawback is that extra code is required for transferring the 'this' pointer to the member functions and that thethis' ' pointer uses one register which might otherwise be used for other purposes.

Allocating objects with new or malloc is inefficient, and should be avoided where possible. For example, a first-in-first-out queue can be implemented in an array with wrap-around, rather than a linked list. If you are using container class templates, then make your own templates that do not use dynamic memory allocation. Dynamic arrays can be made more efficiently with alloca than with new or malloc.

3.5 Use array of structures rather than structure of arrays

Variables that are used together should preferably be stored near each other in order to improve caching. If you have two arrays, a and b, and the elements are accessed in the order a[0], b[0], a[1], b[1], ... then you may improve the performance by making an array of a structure which contains one a and one b.

3.6 Alignment of data

A variable is accessed most efficiently if it is stored at a memory address which is divisible by the size of the variable. For example, a double takes 8 bytes of storage space. It should

therefore preferably be stored at an address divisible by 8. The size should always be a power of 2. Objects bigger than 16 bytes should be stored at an address divisible by 16.

Not all compilers give you access to control data alignment. But you can control the alignment of structure and class members. Consider this structure:

struct

abc {

 

 

 

 

unsigned char a;

// takes 1

byte storage

int

b;

//

4

bytes

storage

double c;

//

8

bytes

storage

} x;

 

 

 

 

 

Assume that x is stored at address N, which is divisible by 8. Then x.a will be at address N, x.b at address N+1, and x.c at address N+5. So x.b and x.c will not be properly aligned. You may change the structure definition to:

struct abc {

 

 

 

double c;

// 8

bytes

storage

int b;

// 4

bytes

storage

unsigned char a;

// 1

byte storage

char unused[3];

// fill up

to 16 bytes

} x;

 

 

 

Now all elements are properly aligned, provided that x is aligned by 8. The 3 extra unused characters make sure that if you have an array of structures, then all elements in the array will be properly aligned.

3.7 Division

Division takes much longer time than addition, subtraction and multiplication. You should therefore minimize the number of divisions.

You can divide an integer by 2n by shifting the binary number n places to the right. All modern compilers will use this trick if the divisor is a power of 2 and known as a constant at compile time. Likewise, multiplication by a power of 2 will be done using a left shift. The method is simpler if the dividend is unsigned. Example:

int n = 1000; int divisor = 8;

int fraction = n / divisor;

Change this to:

int

n

= 1000;

 

const

int divisor = 8;

 

int

fraction = (unsigned)n / divisor;

// (will do n >> 3)

Making divisor a const (or simply writing 8 instead of divisor) makes sure that the compiler can use this optimization. Making n unsigned improves the code even further (assuming that you are certain that n can never be negative).

Floating-point division by a constant should be done by multiplying with the reciprocal:

double n;

double divisor = 1.2345; double fraction = n / divisor;

Change this to:

double n;

const double factor = (1. / 1.2345); double fraction = n * factor;

The compiler will calculate 1./1.2345 at compile time and insert the reciprocal in the code, so you will never spend time doing the division. In fact, any expression that contains only constants and does not involve function calls, will be evaluated by the compiler and replaced by the result.

Divisions can sometimes be eliminated completely. For example:

if (a > b / c)

can often be replaced by

if (a * c > b)

Pitfalls: The inequality sign must be reversed if c < 0. The division is inexact if b and c are integers, while the multiplication is exact. The multiplication may cause overflow.

Multiple divisions can be combined. For example, a1/b1 + a2/b2 should be replaced by (a1*b2 + a2*b1) / (b1*b2) which has one division instead of two. The trick of using a common denominator can even be used on completely independent divisions. Example:

double a1, a2, b1, b2, y1, y2; y1 = a1 / b1;

y2 = a2 / b2;

This can be changed to:

double a1, a2, b1, b2, y1, y2, reciprocal_divisor; reciprocal_divisor = 1. / (b1 * b2);

y1 = a1 * b2 * reciprocal_divisor;

y2 = a2 * b1 * reciprocal_divisor;

You can even do four divisions in one:

double a1, a2, a3, a4, b1, b2, b3, b4, y1, y2, y3, y4; y1 = a1 / b1;

y2 = a2 / b2;

y3 = a3 / b3;

y4 = a4 / b4;

can be replaced by:

double a1, a2, a3, a4, b1, b2, b3, b4, y1, y2, y3, y4; double b12, b34, reciprocal_divisor;

b12 = b1 * b2; b34 = b3 * b4; reciprocal_divisor = 1. / (b12 * b34); y1 = a1 * b2 * b34 * reciprocal_divisor; y2 = a2 * b1 * b34 * reciprocal_divisor; y3 = a3 * b4 * b12 * reciprocal_divisor; y4 = a4 * b3 * b12 * reciprocal_divisor;

It is not recommended to combine more than four divisions; because the time saved by having only one division will be spent on the increased number of multiplications.

3.8 Function calls

When a function with parameters is called, the parameters are stored on the stack by the caller and read again by the called function. This causes some delay if a parameter is part