Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
CUBLAS and MAGMA by example.pdf
Скачиваний:
36
Добавлен:
22.03.2016
Размер:
2.45 Mб
Скачать

2.2 Tests results

14

1.HPL.out is used as output le if the number in the next line is not equal to 6 or 7.

2.Number 6 means that the output goes to stdout. If it is replaced by 5 (for example) then the output goes to HPL.out

3.The number 1 in the third line means that we want to solve exactly one system.

4.The number 100000 denotes the size of the system. Large systems can give better performance but need more memory.

5.The number 1 in the fth line means that we shall try only one datablock size

6.The number 768 denotes the block size. The number should be a multiple of 128. It can be selected experimentally.

7.0 in the next line denotes row-major process mapping (not changed in sample HPL.dat le).

8.Next 1 denotes the number of grids used (in our example only one). Testing four cards, since we have two nodes with two GPUs in each, we choose one PxQ=2x2 grid. PxQ should be equal to the total number of tested GPUs.

9.The number 2 means that the rst dimension of the grid P=2.

10.The number 2 in the next line means that the second dimension of the grid Q=2.

2.2Tests results

At our disposal we had two nodes with Redhat 6.3, CUDA 5.5 and with the following hardware :

two socket Xeon CPU E5-2650, 2.00GHz,

two Tesla S2050 GPUs,

256 GB RAM,

Gigabit Ethernet.

2.2 Tests results

15

2.2.1One Tesla S2050 GPU (428.9 GFlop/s)

For one GPU we have used P=1, Q=1 parameters in HPL.dat and have obtained the following results.

$ mpirun -np 1 ./run_linpack

=======================================================================

T/V N NB P Q Time Gflops

-----------------------------------------------------------------------

WR10L2L2

100000

768

1

1

1554.29

4.289e+02

-----------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=0.0039050 ...PASSED

=======================================================================

2.2.2Two Tesla S2050 GPUs (679.0 GFlop/s)

For two GPUs we have used P=1, Q=2 parameters in HPL.dat and have obtained the following results.

$ mpirun -np 2 ./run_linpack

=======================================================================

T/V N NB P Q Time Gflops

-----------------------------------------------------------------------

WR10L2L2

100000

768

1

2

981.87

6.790e+02

-----------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=0.0035832 ...PASSED

=======================================================================

Remark. For two CPUs, using the CPU Linpack we have obtained 273.8 GFlop/s for N=100000.

2.2.3Four Tesla S2050 GPUs (1363 GFlop/s)

For four GPUs we have used P=2, Q=2 parameters in HPL.dat and have obtained the following results.

# For N=100000

$ mpirun -np 4 -host node1,node2 ./run_linpack

=======================================================================

T/V N NB P Q Time Gflops

-----------------------------------------------------------------------

WR10L2L2

100000

768

2

2

561.98

1.186e+03

-----------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=0.0037021 ...PASSED

2.2 Tests results

16

=======================================================================

# For N=200000

$ mpirun -np 4 -host node1,node2 ./run_linpack

=======================================================================

T/V N NB P Q Time Gflops

-----------------------------------------------------------------------

WR10L2L2

200000

1024

2

2

3912.98

1.363e+03

-----------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=0.0038225 ...PASSED

=======================================================================

Remark. Setting the number of solved systems to 20 and their size to 200000 we have checked that the system is able to work with the 1300 GFlop/s performance over 30 hours.

2.2.4Two Tesla K20m GPUs (1789 GFlop/s)

For two Kepler GPUs and two socket Xeon CPUs E5-2665 we have used P=1, Q=2 parameters in HPL.dat and have obtained the following results.

$ mpirun -np 2 ./run_linpack

=======================================================================

T/V N NB P Q Time Gflops

-----------------------------------------------------------------------

WR10L2L2

100000

768

1

2

372.74

1.789e+03

-----------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=0.0030869 ...PASSED

=======================================================================

Remark. For two E5-2665 CPUs, using the CPU Linpack we have obtained

307.16 GFlop/s for N=100000.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]