- •Matrix computations on systems equipped with GPUs
- •Introduction
- •The evolution of hardware for High Performance Computing
- •The programmability issue on novel graphics architectures
- •About this document. Motivation and structure
- •Motivation and goals
- •Structure of the document
- •Description of the systems used in the experimental study
- •Performance metrics
- •Hardware description
- •Software description
- •The FLAME algorithmic notation
- •The architecture of modern graphics processors
- •The graphics pipeline
- •Programmable pipeline stages
- •The Nvidia G80 as an example of the CUDA architecture
- •The architecture of modern graphics processors
- •General architecture overview. Nvidia Tesla
- •Memory subsystem
- •The GPU as a part of a hybrid system
- •Arithmetic precision. Accuracy and performance
- •Present and future of GPU architectures
- •Conclusions and implications on GPU computing
- •BLAS on single-GPU architectures
- •BLAS: Basic Linear Algebra Subprograms
- •BLAS levels
- •Naming conventions
- •Storage schemes
- •BLAS on Graphics Processors: NVIDIA CUBLAS
- •Evaluation of the performance of NVIDIA CUBLAS
- •Improvements in the performance of Level-3 NVIDIA CUBLAS
- •gemm-based programming for the Level-3 BLAS
- •Systematic development and evaluation of algorithmic variants
- •Experimental results
- •Impact of the block size
- •Performance results for rectangular matrices
- •Performance results for double precision data
- •Padding
- •Conclusions
- •LAPACK-level routines on single-GPU architectures
- •LAPACK: Linear Algebra PACKage
- •LAPACK and BLAS
- •Naming conventions
- •Storage schemes and arguments
- •LAPACK routines and organization
- •Cholesky factorization
- •Scalar algorithm for the Cholesky factorization
- •Blocked algorithm for the Cholesky factorization
- •Computing the Cholesky factorization on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding
- •Hybrid implementation
- •LU factorization
- •Scalar algorithm for the LU factorization
- •Blocked algorithm for the LU factorization
- •LU factorization with partial pivoting
- •Computing the LU factorization with partial pivoting on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding and hybrid algorithm
- •Reduction to tridiagonal form on the graphics processor
- •The symmetric eigenvalue problem
- •Reduction to tridiagonal form. The LAPACK approach
- •Reduction to tridiagonal form. The SBR approach
- •Experimental Results
- •Conclusions
- •Matrix computations on multi-GPU systems
- •Linear algebra computation on multi-GPU systems
- •Programming model and runtime. Performance considerations
- •Programming model
- •Transfer management and spatial assignation
- •Experimental results
- •Impact of the block size
- •Number of data transfers
- •Performance and scalability
- •Impact of data distribution
- •Conclusions
- •Matrix computations on clusters of GPUs
- •Parallel computing memory architectures
- •Shared memory architectures
- •Distributed memory and hybrid architectures
- •Accelerated hybrid architectures
- •Parallel programming models. Message-passing and MPI
- •ScaLAPACK
- •PLAPACK
- •Elemental
- •Description of the PLAPACK infrastructure
- •Layered approach of PLAPACK
- •Usage of the PLAPACK infrastructure. Practical cases
- •Porting PLAPACK to clusters of GPUs
- •Experimental results
- •Conclusions
- •Conclusions
- •Conclusions and main contributions
- •Contributions for systems with one GPU
- •Contributions for clusters of GPUs
- •Related publications
- •Publications directly related with the thesis topics
- •Publications indirectly related with the thesis topics
- •Other publications
- •Open research lines
- •FLAME algorithms for the BLAS-3 routines
CHAPTER 1. MATRIX COMPUTATIONS ON SYSTEMS EQUIPPED WITH GPUS
Each chapter of this document presents the developed work as well as the experimental results attained for the corresponding architecture. In this sense, each part of the document is selfcontained and can be read independently.
Finally, Chapter 7 presents the main conclusions from this research. In addition, it reports the main contributions of the thesis, the publications that have been generated, and the technological transfer activities derived from it. Finally, a few open research lines related to the work are discussed.
1.3.Description of the systems used in the experimental study
1.3.1.Performance metrics
The fundamental metric for performance evaluation (or e ciency) of an application is the execution time. However, codes with intensive floating-point arithmetic operations, as is the case of linear algebra operations, often employ other metrics to evaluate the pace at which these operations are performed. More precisely, the definition of flop is usually bound to a floating-point arithmetic operation. Thus, the execution speed of a linear algebra code is usually given in terms of MFLOPS (106 flops/s), GFLOPS (109 flops/s), or even TFLOPS (109 flops/s). Although the FLOPS rate is a metric derived from the execution time, the arithmetic processing speed (flops/sec) presents a clear advantage in the graphical representation of performance data. Specifically, as the problem size is increased, the execution time of codes for common dense linear algebra operations also increases proportionally (often at a cubic pace). However, the FLOPS rate is limited by the configuration and speed of the hardware (cycle time, amount of functional units, cache transfer rate, bus speed, etc.) Thus, the charts representing the FLOPS rate present an upper bound that makes them much easier to display and analyze.
Although there exist widely extended metrics such as acceleration or e ciency for parallel codes, such as the GPU implementations (also derived from the execution time), we advocate here for the homogeneity in the representations, and we will mostly measure parallel performance in terms of FLOPS. Nevertheless, whenever necessary, we will use other metrics to correctly illustrate parallel performance. In those specific cases, specific metrics will be introduced when necessary.
1.3.2.Hardware description
Three di erent systems have been used in the evaluation of the implementations presented in the following chapters. Those systems are representative of the di erent multi-core architectures present nowadays and, simultaneously, they illustrate how multiple hardware accelerators (in this case, GPUs) can be attached to a single system or a cluster of compute nodes to boost performance.
PECO is a cluster of four nodes interconnected using an Infiniband QDR network. Each node contains two Intel Xeon 5520 (Nehalem) Quadcore processors running at 2.27 Ghz, with 24 Gbytes of DDR2 RAM memory. Attached to the PCIExpress 2.0 bus of each node, there is a NVIDIA C1060 GPU with 4 Gbytes of DDR3 RAM memory. One of the nodes in this machine will be used for the evaluation stage of BLAS and LAPACK-level routines in Chapters 3 and 4.
TESLA2 is a shared memory multiprocessor equipped with the Intel Xeon technology. It is composed of two Intel Xeon 5440 (Harpertown) Quadcore processors running at 2.83 Ghz, with 16 Gbytes of DDR2 RAM memory. Attached to the PCIExpress 2.0 bus, there is a NVIDIA s1070 system consisting of four NVIDIA Tesla C1060 GPUs identical to those present in each
12