- •Matrix computations on systems equipped with GPUs
- •Introduction
- •The evolution of hardware for High Performance Computing
- •The programmability issue on novel graphics architectures
- •About this document. Motivation and structure
- •Motivation and goals
- •Structure of the document
- •Description of the systems used in the experimental study
- •Performance metrics
- •Hardware description
- •Software description
- •The FLAME algorithmic notation
- •The architecture of modern graphics processors
- •The graphics pipeline
- •Programmable pipeline stages
- •The Nvidia G80 as an example of the CUDA architecture
- •The architecture of modern graphics processors
- •General architecture overview. Nvidia Tesla
- •Memory subsystem
- •The GPU as a part of a hybrid system
- •Arithmetic precision. Accuracy and performance
- •Present and future of GPU architectures
- •Conclusions and implications on GPU computing
- •BLAS on single-GPU architectures
- •BLAS: Basic Linear Algebra Subprograms
- •BLAS levels
- •Naming conventions
- •Storage schemes
- •BLAS on Graphics Processors: NVIDIA CUBLAS
- •Evaluation of the performance of NVIDIA CUBLAS
- •Improvements in the performance of Level-3 NVIDIA CUBLAS
- •gemm-based programming for the Level-3 BLAS
- •Systematic development and evaluation of algorithmic variants
- •Experimental results
- •Impact of the block size
- •Performance results for rectangular matrices
- •Performance results for double precision data
- •Padding
- •Conclusions
- •LAPACK-level routines on single-GPU architectures
- •LAPACK: Linear Algebra PACKage
- •LAPACK and BLAS
- •Naming conventions
- •Storage schemes and arguments
- •LAPACK routines and organization
- •Cholesky factorization
- •Scalar algorithm for the Cholesky factorization
- •Blocked algorithm for the Cholesky factorization
- •Computing the Cholesky factorization on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding
- •Hybrid implementation
- •LU factorization
- •Scalar algorithm for the LU factorization
- •Blocked algorithm for the LU factorization
- •LU factorization with partial pivoting
- •Computing the LU factorization with partial pivoting on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding and hybrid algorithm
- •Reduction to tridiagonal form on the graphics processor
- •The symmetric eigenvalue problem
- •Reduction to tridiagonal form. The LAPACK approach
- •Reduction to tridiagonal form. The SBR approach
- •Experimental Results
- •Conclusions
- •Matrix computations on multi-GPU systems
- •Linear algebra computation on multi-GPU systems
- •Programming model and runtime. Performance considerations
- •Programming model
- •Transfer management and spatial assignation
- •Experimental results
- •Impact of the block size
- •Number of data transfers
- •Performance and scalability
- •Impact of data distribution
- •Conclusions
- •Matrix computations on clusters of GPUs
- •Parallel computing memory architectures
- •Shared memory architectures
- •Distributed memory and hybrid architectures
- •Accelerated hybrid architectures
- •Parallel programming models. Message-passing and MPI
- •ScaLAPACK
- •PLAPACK
- •Elemental
- •Description of the PLAPACK infrastructure
- •Layered approach of PLAPACK
- •Usage of the PLAPACK infrastructure. Practical cases
- •Porting PLAPACK to clusters of GPUs
- •Experimental results
- •Conclusions
- •Conclusions
- •Conclusions and main contributions
- •Contributions for systems with one GPU
- •Contributions for clusters of GPUs
- •Related publications
- •Publications directly related with the thesis topics
- •Publications indirectly related with the thesis topics
- •Other publications
- •Open research lines
- •FLAME algorithms for the BLAS-3 routines
CHAPTER 4. LAPACK-LEVEL ROUTINES ON SINGLE-GPU ARCHITECTURES
|
|
|
|
|
|
Reduction to tridiagonal form |
|
||||||
|
|
|
|
|
LAPACK |
|
|
|
SBR |
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
n |
|
|
PECO |
|
PECO |
PECO+GPU |
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2048 |
|
|
0.23 |
|
0.6 |
0.58 |
|
|||||
|
6144 |
|
|
8.4 |
|
8.58 |
6.26 |
|
|||||
|
10240 |
|
|
40.5 |
|
30.4 |
20.32 |
|
|||||
|
24576 |
|
|
582.4 |
|
308.4 |
166.8 |
|
|||||
|
|
|
|
|
|
|
|||||||
|
|
|
Reduction to tridiagonal form and back-transform |
||||||||||
|
|
|
LAPACK |
|
|
|
|
SBR |
|||||
|
|
|
|
|
|
|
|
|
|
||||
n |
|
|
PECO |
|
|
PECO |
|
|
PECO+GPU |
||||
|
|
|
|
|
|
|
|
|
|||||
2048 |
|
|
0.50 |
|
|
1.65 |
|
|
1.39 |
|
|||
6144 |
|
|
13.5 |
|
|
25.6 |
|
|
14.6 |
|
|||
10240 |
|
|
61.6 |
|
|
101.8 |
|
|
47.5 |
|
|||
24576 |
|
|
845.1 |
|
|
1207.2 |
|
|
314.0 |
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 4.6: Comparison of the execution time (in seconds) for the the LAPACK and SBR routines on PECO and SBR accelerated by the GPU on PECO.
Comparing the two approaches
Although the routines that tackle the symmetric eigenvalue problem are structured as a sequence of steps, these are not independent. Therefore, in general the tuning of parameters for each step cannot be done separately. For example, the bandwidth has to be kept constant through all the routines involved in the reduction. The block size, instead, can be adjusted for each routine. Additionally, on the multi-core processors, one may choose the degree of parallelism for each routine by fixing the number of threads employed for its execution. For example, consider the reduction to tridiagonal form of a problem of size n = 10240 when performed on the multi-core in PECO using the SBR routines. For bandwidths w = 32, 64 and 96, the best timings for the reduction to banded form using the corresponding SBR routine are 112.6, 29.6, and 16.1 seconds, using 1, 4 and 8 cores, respectively. The cost for the next stage, reduction from banded to tridiagonal form, is minimized when a single core is used, resulting in 9.51, 11.7 and 15.1 seconds for bandwidths 32, 64 and 96, respectively. Overall, the best combination, totaling 31.2 seconds, corresponds to bandwidth 64, using 8 cores for the first step and a single core for the second.
In Table 4.6, we collect results for an experimental comparison of the two approaches on both architectures: PECO, and the GPU in this platform for all steps except the reduction from banded to tridiagonal form using the SBR routines (labeled as “PECO+GPU”). For small and medium problem sizes LAPACK is the fastest approach. For the largest dimensions, the SBR approach greatly benefits from the acceleration enabled by the GPU, and outperforms LAPACK both in the reduction and back-transform stages.
In the reduction stage, the GPU delivers speedups of 1.5× and 1.8× for the two largest problem sizes compared with the best options (SBR or LAPACK) on any of the two Intel-based architectures. When the back-transform is also required, the speedups for these problem sizes become 1.3× and 2.7×.
4.8.Conclusions
In this chapter, we have demonstrated how the GPU can be a reliable approach to the acceleration of higher-level dense linear algebra routines. As driving examples, we have chosen representa-
114
4.8. CONCLUSIONS
tive and widely used LAPACK-level operations to illustrate a number of techniques that, following a high-level approach, improve the performance of the implementations.
In the first part of the chapter, we have addressed the Cholesky and LU with partial pivoting factorizations. The usage of a high-level approach allows us to systematically derive and implement a number of algorithmic variants. Among them, it is possible to choose the most convenient one for a given architecture or BLAS implementation.
The implementation of blocked and unblocked routines for both operations have yield a collection of conclusions that result from the study developed in the chapter. First, the usage of blocked implementations is a must on current graphics processors. The properties of modern GPUs transform them into a platform of special appeal for blocked computations. However, the capabilities of general-purpose multi-core processors when operating with small datasets is one of their main strengths. This divergence naturally drives to the design and development of hybrid algorithms, in which CPU and GPU collaborate in the solution of a problem. In our case, the usage of a hybrid approach has been successfully applied to both operations. Experimental results validate the advantadges of the solution.
Double precision is most often required in scientific codes. In our case, we have identified the poor performance of modern GPUs when operating on double-precision data. To solve this drawback in the context of the solution of systems of linear equations, we propose a mixed-precision iterative refinement approach, in which the major part of the computation is performed using single precision, but CPU and GPU collaborate to regain double-precision accuracy. Experimental results show that this approach can exploit the performance of modern GPUs when they operate using single-precision arithmetic while delivering the accuracy of double precision.
For the symmetric eigenvalue problem, we have evaluated the performance of existing codes for the reduction of a dense matrix to tridiagonal form and back-transform. Our experimental results confirm that the two-stage approach proposed in the SBR toolbox (reduction from full to banded form in the first stage followed by a reduction from banded to tridiagonal form in a second stage) delivers a higher parallel scalability than the LAPACK-based alternative on general-purpose multicore architectures. However, when the orthogonal factors that define the back-transform have to be constructed and applied in the last stage, this results in a computation time considerably larger than that of LAPACK.
The use of a hardware accelerator like a GPU changes the message dramatically. By o -loading the level-3 BLAS operations in the SBR codes to the GPU, remarkable speed-ups are attained to the point that the SBR toolbox becomes a competitive alternative to the standard LAPACK-based algorithm. The reward did not come e ortless, though. Specifically, the advantages came from two improvements: First, the application of the routines developed in Chapter 3 for the rank-2k and symmetric matrix-matrix product; second, a careful modification of the SBR routines to exploit the hardware elements of the hybrid CPU-GPU architecture and to minimize the number of data transfers between the host and the device memory spaces.
115
CHAPTER 4. LAPACK-LEVEL ROUTINES ON SINGLE-GPU ARCHITECTURES
116
Part III
Matrix computations on multi-GPU systems
117