- •Foreword
- •CUDA installation
- •Installing CUDA environment
- •Measuring GPUs performance
- •Linpack benchmark for CUDA
- •Tests results
- •One Tesla S2050 GPU (428.9 GFlop/s)
- •Two Tesla S2050 GPUs (679.0 GFlop/s)
- •Four Tesla S2050 GPUs (1363 GFlop/s)
- •Two Tesla K20m GPUs (1789 GFlop/s)
- •CUBLAS by example
- •General remarks on the examples
- •CUBLAS Level-1. Scalar and vector based operations
- •cublasIsamax, cublasIsamin - maximal, minimal elements
- •cublasSasum - sum of absolute values
- •cublasScopy - copy vector into vector
- •cublasSdot - dot product
- •cublasSnrm2 - Euclidean norm
- •cublasSrot - apply the Givens rotation
- •cublasSrotg - construct the Givens rotation matrix
- •cublasSscal - scale the vector
- •cublasSswap - swap two vectors
- •CUBLAS Level-2. Matrix-vector operations
- •cublasSger - rank one update
- •cublasStbsv - solve the triangular banded linear system
- •cublasStpsv - solve the packed triangular linear system
- •cublasStrsv - solve the triangular linear system
- •CUBLAS Level-3. Matrix-matrix operations
- •cublasStrsm - solving the triangular linear system
- •MAGMA by example
- •General remarks on Magma
- •Remarks on installation and compilation
- •Remarks on hardware used in examples
- •Magma BLAS
- •LU decomposition and solving general linear systems
- •QR decomposition and the least squares solution of general systems
- •Eigenvalues and eigenvectors for general matrices
- •Eigenvalues and eigenvectors for symmetric matrices
- •Singular value decomposition
2.2 Tests results |
14 |
1.HPL.out is used as output le if the number in the next line is not equal to 6 or 7.
2.Number 6 means that the output goes to stdout. If it is replaced by 5 (for example) then the output goes to HPL.out
3.The number 1 in the third line means that we want to solve exactly one system.
4.The number 100000 denotes the size of the system. Large systems can give better performance but need more memory.
5.The number 1 in the fth line means that we shall try only one datablock size
6.The number 768 denotes the block size. The number should be a multiple of 128. It can be selected experimentally.
7.0 in the next line denotes row-major process mapping (not changed in sample HPL.dat le).
8.Next 1 denotes the number of grids used (in our example only one). Testing four cards, since we have two nodes with two GPUs in each, we choose one PxQ=2x2 grid. PxQ should be equal to the total number of tested GPUs.
9.The number 2 means that the rst dimension of the grid P=2.
10.The number 2 in the next line means that the second dimension of the grid Q=2.
2.2Tests results
At our disposal we had two nodes with Redhat 6.3, CUDA 5.5 and with the following hardware :
two socket Xeon CPU E5-2650, 2.00GHz,
two Tesla S2050 GPUs,
256 GB RAM,
Gigabit Ethernet.
2.2 Tests results |
15 |
2.2.1One Tesla S2050 GPU (428.9 GFlop/s)
For one GPU we have used P=1, Q=1 parameters in HPL.dat and have obtained the following results.
$ mpirun -np 1 ./run_linpack
=======================================================================
T/V N NB P Q Time Gflops
-----------------------------------------------------------------------
WR10L2L2 |
100000 |
768 |
1 |
1 |
1554.29 |
4.289e+02 |
-----------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=0.0039050 ...PASSED
=======================================================================
2.2.2Two Tesla S2050 GPUs (679.0 GFlop/s)
For two GPUs we have used P=1, Q=2 parameters in HPL.dat and have obtained the following results.
$ mpirun -np 2 ./run_linpack
=======================================================================
T/V N NB P Q Time Gflops
-----------------------------------------------------------------------
WR10L2L2 |
100000 |
768 |
1 |
2 |
981.87 |
6.790e+02 |
-----------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=0.0035832 ...PASSED
=======================================================================
Remark. For two CPUs, using the CPU Linpack we have obtained 273.8 GFlop/s for N=100000.
2.2.3Four Tesla S2050 GPUs (1363 GFlop/s)
For four GPUs we have used P=2, Q=2 parameters in HPL.dat and have obtained the following results.
# For N=100000
$ mpirun -np 4 -host node1,node2 ./run_linpack
=======================================================================
T/V N NB P Q Time Gflops
-----------------------------------------------------------------------
WR10L2L2 |
100000 |
768 |
2 |
2 |
561.98 |
1.186e+03 |
-----------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=0.0037021 ...PASSED
2.2 Tests results |
16 |
=======================================================================
# For N=200000
$ mpirun -np 4 -host node1,node2 ./run_linpack
=======================================================================
T/V N NB P Q Time Gflops
-----------------------------------------------------------------------
WR10L2L2 |
200000 |
1024 |
2 |
2 |
3912.98 |
1.363e+03 |
-----------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=0.0038225 ...PASSED
=======================================================================
Remark. Setting the number of solved systems to 20 and their size to 200000 we have checked that the system is able to work with the 1300 GFlop/s performance over 30 hours.
2.2.4Two Tesla K20m GPUs (1789 GFlop/s)
For two Kepler GPUs and two socket Xeon CPUs E5-2665 we have used P=1, Q=2 parameters in HPL.dat and have obtained the following results.
$ mpirun -np 2 ./run_linpack
=======================================================================
T/V N NB P Q Time Gflops
-----------------------------------------------------------------------
WR10L2L2 |
100000 |
768 |
1 |
2 |
372.74 |
1.789e+03 |
-----------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=0.0030869 ...PASSED
=======================================================================
Remark. For two E5-2665 CPUs, using the CPU Linpack we have obtained
307.16 GFlop/s for N=100000.