- •Matrix computations on systems equipped with GPUs
- •Introduction
- •The evolution of hardware for High Performance Computing
- •The programmability issue on novel graphics architectures
- •About this document. Motivation and structure
- •Motivation and goals
- •Structure of the document
- •Description of the systems used in the experimental study
- •Performance metrics
- •Hardware description
- •Software description
- •The FLAME algorithmic notation
- •The architecture of modern graphics processors
- •The graphics pipeline
- •Programmable pipeline stages
- •The Nvidia G80 as an example of the CUDA architecture
- •The architecture of modern graphics processors
- •General architecture overview. Nvidia Tesla
- •Memory subsystem
- •The GPU as a part of a hybrid system
- •Arithmetic precision. Accuracy and performance
- •Present and future of GPU architectures
- •Conclusions and implications on GPU computing
- •BLAS on single-GPU architectures
- •BLAS: Basic Linear Algebra Subprograms
- •BLAS levels
- •Naming conventions
- •Storage schemes
- •BLAS on Graphics Processors: NVIDIA CUBLAS
- •Evaluation of the performance of NVIDIA CUBLAS
- •Improvements in the performance of Level-3 NVIDIA CUBLAS
- •gemm-based programming for the Level-3 BLAS
- •Systematic development and evaluation of algorithmic variants
- •Experimental results
- •Impact of the block size
- •Performance results for rectangular matrices
- •Performance results for double precision data
- •Padding
- •Conclusions
- •LAPACK-level routines on single-GPU architectures
- •LAPACK: Linear Algebra PACKage
- •LAPACK and BLAS
- •Naming conventions
- •Storage schemes and arguments
- •LAPACK routines and organization
- •Cholesky factorization
- •Scalar algorithm for the Cholesky factorization
- •Blocked algorithm for the Cholesky factorization
- •Computing the Cholesky factorization on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding
- •Hybrid implementation
- •LU factorization
- •Scalar algorithm for the LU factorization
- •Blocked algorithm for the LU factorization
- •LU factorization with partial pivoting
- •Computing the LU factorization with partial pivoting on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding and hybrid algorithm
- •Reduction to tridiagonal form on the graphics processor
- •The symmetric eigenvalue problem
- •Reduction to tridiagonal form. The LAPACK approach
- •Reduction to tridiagonal form. The SBR approach
- •Experimental Results
- •Conclusions
- •Matrix computations on multi-GPU systems
- •Linear algebra computation on multi-GPU systems
- •Programming model and runtime. Performance considerations
- •Programming model
- •Transfer management and spatial assignation
- •Experimental results
- •Impact of the block size
- •Number of data transfers
- •Performance and scalability
- •Impact of data distribution
- •Conclusions
- •Matrix computations on clusters of GPUs
- •Parallel computing memory architectures
- •Shared memory architectures
- •Distributed memory and hybrid architectures
- •Accelerated hybrid architectures
- •Parallel programming models. Message-passing and MPI
- •ScaLAPACK
- •PLAPACK
- •Elemental
- •Description of the PLAPACK infrastructure
- •Layered approach of PLAPACK
- •Usage of the PLAPACK infrastructure. Practical cases
- •Porting PLAPACK to clusters of GPUs
- •Experimental results
- •Conclusions
- •Conclusions
- •Conclusions and main contributions
- •Contributions for systems with one GPU
- •Contributions for clusters of GPUs
- •Related publications
- •Publications directly related with the thesis topics
- •Publications indirectly related with the thesis topics
- •Other publications
- •Open research lines
- •FLAME algorithms for the BLAS-3 routines
CHAPTER 4
LAPACK-level routines on single-GPU architectures
The optimization of BLAS-3 routines on graphics processors naturally drives to a direct optimization of higher-level libraries based on them, such as LAPACK (Linear Algebra PACKage). However, given the complexity of the routines in this type of libraries, other strategies can be applied to further improve their performance.
In the case of GPU-based implementations, the optimizations applied to the BLAS routines in Chapter 3 can have a direct impact in LAPACK-level implementations, but we advocate for alternative strategies to gain insights and further improve the performance of those implementations using the GPU as an accelerating coprocessor.
In this chapter, we propose a set of improved GPU-based implementations for some representative and widely used LAPACK-level routines devoted to matrix decompositions. New implementations for the Cholesky and LU (with partial pivoting) decompositions, and the reduction to tridiagonal form are proposed and deeply evaluated.
In addition, a systematic evaluation of algorithmic variants similar to that presented in the previous chapter for the BLAS is performed for LAPACK-level routines, together with a set of techniques to boost performance. One of the most innovative technique introduced in this chapter is the view of the GPU as an accelerating co-processor, not only as an isolated functional unit as in the previous chapter. Thus, hybrid, collaborative approaches are proposed in which operations are performed in the most suitable architecture, depending on the particular characteristics of the task to be performed.
Single and double-precision results are presented for the new implementations and, as a novelty, a mixed-precision iterative-refinement approach for the solution of systems of linear equations is presented and validated. The goal of this technique is to exploit the higher performance delivered by modern graphics processors when operating in single-precision arithmetic, keeping at the same time full accuracy in the solution of the system.
As a result, a full family of implementations for widely used LAPACK-level routines is presented, which attains significant speedups compared with optimized, multi-threaded implementations on modern general-purpose multi-core processors.
The chapter is organized as follows. Section 4.1 surveys the nomenclature and the most important routines in the LAPACK library. Sections 4.2 and 4.3 introduce the theory underlying the Cholesky factorization, and the approaches and optimizations taken to implement it on the graphics
73