- •Matrix computations on systems equipped with GPUs
- •Introduction
- •The evolution of hardware for High Performance Computing
- •The programmability issue on novel graphics architectures
- •About this document. Motivation and structure
- •Motivation and goals
- •Structure of the document
- •Description of the systems used in the experimental study
- •Performance metrics
- •Hardware description
- •Software description
- •The FLAME algorithmic notation
- •The architecture of modern graphics processors
- •The graphics pipeline
- •Programmable pipeline stages
- •The Nvidia G80 as an example of the CUDA architecture
- •The architecture of modern graphics processors
- •General architecture overview. Nvidia Tesla
- •Memory subsystem
- •The GPU as a part of a hybrid system
- •Arithmetic precision. Accuracy and performance
- •Present and future of GPU architectures
- •Conclusions and implications on GPU computing
- •BLAS on single-GPU architectures
- •BLAS: Basic Linear Algebra Subprograms
- •BLAS levels
- •Naming conventions
- •Storage schemes
- •BLAS on Graphics Processors: NVIDIA CUBLAS
- •Evaluation of the performance of NVIDIA CUBLAS
- •Improvements in the performance of Level-3 NVIDIA CUBLAS
- •gemm-based programming for the Level-3 BLAS
- •Systematic development and evaluation of algorithmic variants
- •Experimental results
- •Impact of the block size
- •Performance results for rectangular matrices
- •Performance results for double precision data
- •Padding
- •Conclusions
- •LAPACK-level routines on single-GPU architectures
- •LAPACK: Linear Algebra PACKage
- •LAPACK and BLAS
- •Naming conventions
- •Storage schemes and arguments
- •LAPACK routines and organization
- •Cholesky factorization
- •Scalar algorithm for the Cholesky factorization
- •Blocked algorithm for the Cholesky factorization
- •Computing the Cholesky factorization on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding
- •Hybrid implementation
- •LU factorization
- •Scalar algorithm for the LU factorization
- •Blocked algorithm for the LU factorization
- •LU factorization with partial pivoting
- •Computing the LU factorization with partial pivoting on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding and hybrid algorithm
- •Reduction to tridiagonal form on the graphics processor
- •The symmetric eigenvalue problem
- •Reduction to tridiagonal form. The LAPACK approach
- •Reduction to tridiagonal form. The SBR approach
- •Experimental Results
- •Conclusions
- •Matrix computations on multi-GPU systems
- •Linear algebra computation on multi-GPU systems
- •Programming model and runtime. Performance considerations
- •Programming model
- •Transfer management and spatial assignation
- •Experimental results
- •Impact of the block size
- •Number of data transfers
- •Performance and scalability
- •Impact of data distribution
- •Conclusions
- •Matrix computations on clusters of GPUs
- •Parallel computing memory architectures
- •Shared memory architectures
- •Distributed memory and hybrid architectures
- •Accelerated hybrid architectures
- •Parallel programming models. Message-passing and MPI
- •ScaLAPACK
- •PLAPACK
- •Elemental
- •Description of the PLAPACK infrastructure
- •Layered approach of PLAPACK
- •Usage of the PLAPACK infrastructure. Practical cases
- •Porting PLAPACK to clusters of GPUs
- •Experimental results
- •Conclusions
- •Conclusions
- •Conclusions and main contributions
- •Contributions for systems with one GPU
- •Contributions for clusters of GPUs
- •Related publications
- •Publications directly related with the thesis topics
- •Publications indirectly related with the thesis topics
- •Other publications
- •Open research lines
- •FLAME algorithms for the BLAS-3 routines
CHAPTER 6
Matrix computations on clusters of GPUs
In the previous chapter, we have demonstrated how multi-GPU systems can be e ciently used to attain high performance without major changes from the programmability point of view. However, the scalability of this type of platforms is a problem without an easy solution in the near future. The main bottleneck remains in the PCIExpress bus. Systems with up to four GPUs attached to the same PCIExpress bus are relatively extended nowadays, but including a higher number of GPUs incurs in a serious bottleneck in data transfers with current technology.
To address this problem, clusters with a reduced number of hardware accelerators attached to each node seem an e ective solution to the performance demands of large-scale HPC applications. As the performance of the interconnection networks (e.g. Infiniband) improves, the gap between them and the PCIExpress is reduced and they commence to be comparable in bandwidth and latency. Thus, the overhead introduced by the usage of distributed memory can be masked by the second penalty induced by the usage of the PCIExpress bus.
From the software point of view, as of today there are no dense linear algebra libraries adapted for the extension of the nodes with hardware accelerators (e.g., GPUs). In this chapter, we propose an extension of the well-known PLAPACK library to adapt it to clusters of GPUs. The selection of this library is based on its modular and layered design and, following the programmability goals stated during the rest of the dissertation, on its high-level approach from the developer point of view. We propose di erent techniques to improve performance and reduce data transfers, and show experimental results for some extended dense linear algebra operations from BLAS and LAPACK on a large GPU cluster.
The chapter is structured as follows. Sections 6.1 and 6.2 introduce the basic concepts behind distributed-memory architectures and message-passing programming, respectively, that will be useful in the rest of the chapter. In Section 6.3 we o er an review of the most extended libraries for dense linear algebra computation on distributed-memory architectures. This overview includes present and forthcoming libraries. In Section 6.4 we expose the layered structure of the PLAPACK library. Section 6.5 describes the process and design decisions taken to port PLAPACK to clusters with GPUs. Experimental results on a large GPU cluster are given in Section 6.6. Finally, Section 6.7 summarizes the main contributions of the chapter.
165