- •Matrix computations on systems equipped with GPUs
- •Introduction
- •The evolution of hardware for High Performance Computing
- •The programmability issue on novel graphics architectures
- •About this document. Motivation and structure
- •Motivation and goals
- •Structure of the document
- •Description of the systems used in the experimental study
- •Performance metrics
- •Hardware description
- •Software description
- •The FLAME algorithmic notation
- •The architecture of modern graphics processors
- •The graphics pipeline
- •Programmable pipeline stages
- •The Nvidia G80 as an example of the CUDA architecture
- •The architecture of modern graphics processors
- •General architecture overview. Nvidia Tesla
- •Memory subsystem
- •The GPU as a part of a hybrid system
- •Arithmetic precision. Accuracy and performance
- •Present and future of GPU architectures
- •Conclusions and implications on GPU computing
- •BLAS on single-GPU architectures
- •BLAS: Basic Linear Algebra Subprograms
- •BLAS levels
- •Naming conventions
- •Storage schemes
- •BLAS on Graphics Processors: NVIDIA CUBLAS
- •Evaluation of the performance of NVIDIA CUBLAS
- •Improvements in the performance of Level-3 NVIDIA CUBLAS
- •gemm-based programming for the Level-3 BLAS
- •Systematic development and evaluation of algorithmic variants
- •Experimental results
- •Impact of the block size
- •Performance results for rectangular matrices
- •Performance results for double precision data
- •Padding
- •Conclusions
- •LAPACK-level routines on single-GPU architectures
- •LAPACK: Linear Algebra PACKage
- •LAPACK and BLAS
- •Naming conventions
- •Storage schemes and arguments
- •LAPACK routines and organization
- •Cholesky factorization
- •Scalar algorithm for the Cholesky factorization
- •Blocked algorithm for the Cholesky factorization
- •Computing the Cholesky factorization on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding
- •Hybrid implementation
- •LU factorization
- •Scalar algorithm for the LU factorization
- •Blocked algorithm for the LU factorization
- •LU factorization with partial pivoting
- •Computing the LU factorization with partial pivoting on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding and hybrid algorithm
- •Reduction to tridiagonal form on the graphics processor
- •The symmetric eigenvalue problem
- •Reduction to tridiagonal form. The LAPACK approach
- •Reduction to tridiagonal form. The SBR approach
- •Experimental Results
- •Conclusions
- •Matrix computations on multi-GPU systems
- •Linear algebra computation on multi-GPU systems
- •Programming model and runtime. Performance considerations
- •Programming model
- •Transfer management and spatial assignation
- •Experimental results
- •Impact of the block size
- •Number of data transfers
- •Performance and scalability
- •Impact of data distribution
- •Conclusions
- •Matrix computations on clusters of GPUs
- •Parallel computing memory architectures
- •Shared memory architectures
- •Distributed memory and hybrid architectures
- •Accelerated hybrid architectures
- •Parallel programming models. Message-passing and MPI
- •ScaLAPACK
- •PLAPACK
- •Elemental
- •Description of the PLAPACK infrastructure
- •Layered approach of PLAPACK
- •Usage of the PLAPACK infrastructure. Practical cases
- •Porting PLAPACK to clusters of GPUs
- •Experimental results
- •Conclusions
- •Conclusions
- •Conclusions and main contributions
- •Contributions for systems with one GPU
- •Contributions for clusters of GPUs
- •Related publications
- •Publications directly related with the thesis topics
- •Publications indirectly related with the thesis topics
- •Other publications
- •Open research lines
- •FLAME algorithms for the BLAS-3 routines
CHAPTER 7
Conclusions
7.1.Conclusions and main contributions
The main goal of the dissertation is the design, development and evaluation of programming strategies to improve the performance of existing dense linear algebra libraries on systems based on graphics processors. This primary goal has been pursued keeping some important premises in the framework of dense linear algebra computations:
Performance: All linear algebra routines, independently from the target architecture, have attained important speedups compared with their counterpart tuned versions for modern CPUs. When possible, a comparison with tuned GPU implementations of the routines (such as NVIDIA CUBLAS in the case of single-GPU BLAS developments) has demonstrated that the proposed implementations also attain better performance results for all routines. These results validate the application of the FLAME technology, and the high-level approach in the development of new codes, not only from the programmability point of view, but also from the performance perspective.
Programmability: As a second contribution of the thesis, we have applied the FLAME methodology, and a high-level developing approach, to di erent GPU-based architectures. Remarkably, no low-level CUDA codes have been written in the framework of the routines developed for single-GPU architectures, systems based on multiple GPUs or clusters of GPUs, yet high performances were obtained. This gives an idea of the benefits of the application of FLAME also for novel architectures, and its benefits on programmability, with independence from the target architecture in which those routines are executed.
Versatility: As a third contribution, the separation of concerns introduced by FLAME (and, similarly, by PLAPACK) enables the application of strategies similar to those developed in this dissertation to virtually any accelerator-based architecture similar to those used in our evaluations. Our purpose was to detach the programmer’s view from the underlying architecture. Proceeding in this manner, programming models and user’s codes remain basically unmodified for accelerated or non-accelerated platforms. In addition, the concepts approached at the architectural level are common for other types of accelerated platforms, not only GPU-based systems.
201
CHAPTER 7. CONCLUSIONS
In addition to the proposed high-level methodologies and optimization techniques, as a result of the developed work we have obtained a full family of implementations for BLAS-level and LAPACK-level routines on three di erent GPU-based platforms: single-GPU, multi-GPU, and clusters of GPUs. Furthermore, many of the techniques described in this dissertation are general enough to be applied to other linear algebra implementations if a full, optimized implementation of BLAS-level and LAPACK-level implementations is desired on present and future accelerator-based architectures.
In general, the developed work has proved that a high-level approach, as that pursued in the framework of the FLAME project, is perfectly valid to port existing linear algebra libraries to novel architectures and to attain high performance at the same time. In addition, this type of methodologies presents two main advantages. First, ease of code development: FLAME provides APIs and tools to easily perform the transition from algorithms to code [26, 126]. Second, flexibility and portability of the developed codes: similar simultaneous and independent studies have addressed low-level coding strategies for linear algebra implementations on the same target architectures as ours. Although performance results for those works are remarkable, further coding e orts will be necessary if those codes need to be adapted to future accelerator-based architectures. With our methodologies, the programming low-level e ort is focused on a reduced number of basic routines, on top of which the rest of the codes are developed.
Whether GPUs will remain as a reference in accelerator-based platforms is not clear at this time. Manufacturers are working together to apply unified programming models, such as OpenCL. From the software perspective, these e orts should derive in common software developments for all type of accelerator-based platforms. However, the programming e orts required to tune these codes to each specific platform, will remain nonnegligible. In addition, the extension of the developed routines to more sophisticated platforms, such as systems with multiple accelerators or clusters of GPUs, will remain a non-trivial task. These problems will be even more severe if heterogeneous systems, with accelerators of di erent nature sharing common resources, appear in response to the heterogeneity of some applications.
We have validated the proposed methodologies on three di erent architectures: systems with one GPU, systems with multiple GPUs, and clusters of GPUs. The general conclusions extracted from the programmability and performance points of view are common for the three architectures. However, several specific contributions have been developed for each platform. The following sections detail those specific contributions.
7.1.1.Contributions for systems with one GPU
A full set of implementations for BLAS-level routines and for some representative LAPACK-level routines has been developed and evaluated on systems with one GPU.
A complete implementation and evaluation of the main Level-3 BLAS routines has been performed. Experimental results reveal how, taking advantage of the high-performance of the GEMM implementation in NVIDIA CUBLAS, all derived BLAS routines outperform the corresponding implementations in the library from NVIDIA. The programming e ort is reduced as no low-level CUDA code is necessary. In addition, as shown, a similar methodology is common for every routine in Level-3 BLAS; results have been validated for single and double precision, extracting similar qualitative improvements in performance. Following the FLAME philosophy, a systematic development and evaluation of several algorithmic variants for each routine, together with a thorough evaluation to find the optimal block size for each one, o ers a complete family of tuned implementations for each BLAS routine. From those results results, the optimal combination of block size and algorithmic variant has been determined.
202
7.1. CONCLUSIONS AND MAIN CONTRIBUTIONS
Performance results improve those of the proprietary library from NVIDIA (NVIDIA CUBLAS), attaining remarkable speedups. Despite the e orts devoted by the company towards the optimization of the library, the advantages of our high-level approach are obvious: speedups between 2x and 4x are attained for all the evaluated routines in Level-3 BLAS, achieving peak speedups of about 14x in specific cases (for example, rectangular matrices).
Several contributions are derived from the implementation of LAPACK-level routines on systems with one GPU. As for the BLAS implementation, a systematic derivation and evaluation of several algorithmic variants has been performed for the Cholesky factorization and the LU factorization with partial pivoting. The detailed study of the optimal block size for each variant yield remarkable speedups compared with tuned multi-threaded implementations on modern multi-core processors.
For the first time, a GPU-accelerated system is viewed as a hybrid architecture; in response to this, we propose a set of hybrid implementations, in which each sub-operation in the factorizations is executed in the most suitable computing resource. Performance results for the hybrid approach improve the basic implementations in all cases.
Regarding precision, a previous evaluation of the BLAS routines reveals a dramatic performance drop in modern graphics processors depending on the floating-point arithmetic precision employed. To deal with this limitation, we apply a mixed precision iterative refinement approach which combines the performance potential of modern GPUs with the accuracy in results of double precision in the solution of linear systems using the Cholesky factorization or the LU factorization.
The reduction to tridiagonal form has been implemented on systems with one GPU in order to illustrate the capabilities of the architecture for other type of LAPACK-level implementations. In this case, we have performed a comparison between the implementations in LAPACK and our approach based on the SBR toolkit, following hybrid approach. Performance results have shown that the latter option, using the GPU to accelerate the critical parts of the algorithm, outperforms the LAPACK implementation.
7.1.2.Contributions for multi-GPU systems
On systems with multiple GPUs sharing common resources, the programming e ort is multiplied due to the existence of several independent memory spaces, one per GPU attached to the system, plus that of main memory. Given that no direct GPU-GPU communication is possible, and that the PCIExpress bus becomes the main bottleneck as the number of GPUs is increased, we propose a runtime that transparently and e ciently manages data allocation and communication between GPUs. In addition, the runtime orchestrates the execution of the routines by exploiting task-level parallelism in a transparent manner from the programmer point of view.
In this sense, a second major contribution in our work is the consideration of each GPU in the system as a single processing unit, delegating further parallelization levels to ad-hoc GPU implementations, such as NVIDIA CUBLAS.
Performance results for a set of representative BLAS routines and for the Cholesky factorization reveal the benefits of our approach. We propose a set of improvements to boost performance, namely the consideration of each GPU memory as a software-managed cache of recently-used blocks, or the application of well-known techniques in computer architecture such as a write-back policy that allows inconsistencies of blocks of data between main memory and GPU memory, or a write-invalidate policy to keep coherence of data blocks among di erent memory spaces. The ultimate goal of these policies is the reduction of the number of data movements between accelerators and main memory. These techniques have demonstrated their benefits for all the evaluated operations.
The proposed approach is fundamentally di erent from that of the MAGMA project. First, the integration of our methodology into libflame provides a straightforward, full port of all the
203