- •Foreword
- •CUDA installation
- •Installing CUDA environment
- •Measuring GPUs performance
- •Linpack benchmark for CUDA
- •Tests results
- •One Tesla S2050 GPU (428.9 GFlop/s)
- •Two Tesla S2050 GPUs (679.0 GFlop/s)
- •Four Tesla S2050 GPUs (1363 GFlop/s)
- •Two Tesla K20m GPUs (1789 GFlop/s)
- •CUBLAS by example
- •General remarks on the examples
- •CUBLAS Level-1. Scalar and vector based operations
- •cublasIsamax, cublasIsamin - maximal, minimal elements
- •cublasSasum - sum of absolute values
- •cublasScopy - copy vector into vector
- •cublasSdot - dot product
- •cublasSnrm2 - Euclidean norm
- •cublasSrot - apply the Givens rotation
- •cublasSrotg - construct the Givens rotation matrix
- •cublasSscal - scale the vector
- •cublasSswap - swap two vectors
- •CUBLAS Level-2. Matrix-vector operations
- •cublasSger - rank one update
- •cublasStbsv - solve the triangular banded linear system
- •cublasStpsv - solve the packed triangular linear system
- •cublasStrsv - solve the triangular linear system
- •CUBLAS Level-3. Matrix-matrix operations
- •cublasStrsm - solving the triangular linear system
- •MAGMA by example
- •General remarks on Magma
- •Remarks on installation and compilation
- •Remarks on hardware used in examples
- •Magma BLAS
- •LU decomposition and solving general linear systems
- •QR decomposition and the least squares solution of general systems
- •Eigenvalues and eigenvectors for general matrices
- •Eigenvalues and eigenvectors for symmetric matrices
- •Singular value decomposition
Chapter 2
Measuring GPUs performance
2.1Linpack benchmark for CUDA
Registered developers can download from https://developer.nvidia.com/ the version of Linpack benchmark prepared specially for CUDA. In August, 2013 the current version for Tesla cards was hpl-2.0 FERMI v15.tgz.
After uncompressing one obtains the directory hpl-2.0 FERMI v15. We enter the directory
$ cd hpl-2.0_FERMI_v15
The le INSTALL contains installation instructions. The example le Make.CUDA should be edited. In our system we have edited (only) the following lines:
TOPdir = $HOME/hpl-2.0_FERMI_v15 |
|
MPdir = /usr/lib64/openmpi |
# Redhat/Centos default |
MPinc = -I/usr/include/openmpi-x86_64 |
# for OpenMPI |
MPlib = -L/usr/lib64/openmpi/lib |
|
LAdir = /opt/intel/mkl/lib/intel64 |
# MKL presence assumed !!! |
LAinc = -I/opt/intel/mkl/include |
|
LAlib = -L$(TOPdir)/src/cuda -ldgemm -L/usr/local/cuda/lib64 -lcuda -lcudart -lcublas -L$(LAdir) -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5
After entering the directory we can do the compilation
$ make
which creates in hpl-2.0 FERMI v15/bin/CUDA a new executable xhpl. We can enter the directory
$ cd bin/CUDA
and edit two les run linpack and HPL.dat. For example in run linpack script le we edited (only) the two lines
2.1 Linpack benchmark for CUDA |
13 |
HPL_DIR=$HOME/hpl-2.0_FERMI_v15
CPU_CORES_PER_GPU=8
(two eight core CPUs + two S2050 GPUs in each of two nodes). The le HPL.dat contains the description of the problem to be solved. Linpack solves dense NxN systems of linear equations in double precision. Users can specify in HPL.dat the number of problems, their sizes and some other parameters. The detailed description of this le can be found in hpl-2.0 FERMI v15/TUNING.
For our benchmarks we have edited the sample HPL.dat le:
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out |
output file name (if any) |
6 |
device out (6=stdout,7=stderr,file) |
1 |
# of problems sizes (N) |
100000 |
Ns |
1 |
# of NBs |
768 |
NBs |
0 |
PMAP process mapping (0=Row-,1=Column-major) |
1 |
# of process grids (P x Q) |
2 |
Ps |
2 |
Qs |
16.0threshold
1 |
|
|
# of panel fact |
0 |
1 |
2 |
PFACTs (0=left, 1=Crout, 2=Right) |
1 |
|
|
# of recursive stopping criterium |
2 |
8 |
|
NBMINs (>= 1) |
1 |
|
|
# of panels in recursion |
2 |
|
|
NDIVs |
1 |
|
|
# of recursive panel fact. |
0 |
1 |
2 |
RFACTs (0=left, 1=Crout, 2=Right) |
1 |
|
|
# of broadcast |
0 |
2 |
|
BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) |
1 |
|
|
# of lookahead depth |
1 |
0 |
|
DEPTHs (>=0) |
1 |
|
|
SWAP (0=bin-exch,1=long,2=mix) |
192 |
|
swapping threshold |
|
1 |
|
|
L1 in (0=transposed,1=no-transposed) form |
1 |
|
|
U in (0=transposed,1=no-transposed) form |
1 |
|
|
Equilibration (0=no,1=yes) |
8 |
|
|
memory alignment in double (> 0) |
Let us comment the rst ten lines of this le (beginning from HPL.out). The remaining lines were unchanged.