BLAS on single-GPU architectures

The Basic Linear Algebra Subprograms (BLAS) are the fundamental building blocks for the development of complex dense linear algebra applications. In this chapter, the implementation of the Level-3 BLAS speciﬁcation from NVIDIA (CUBLAS) is evaluated. The major contribution though is the design and evaluation of a new, faster implementation of the main routines in Level 3 BLAS. The aim of these new implementations is twofold: First, to improve the performance of the existing BLAS implementations for graphics processors. Second, to illustrate a methodology to systematically evaluate a number of parameters that become crucial to attain high performance. To achieve these goals, a set of algorithmic variants that take beneﬁt from a reduced number of existing high-performance BLAS kernels is presented, together with a detailed evaluation of the performance of those new implementations.

As a result, our new implementations attain remarkable speedups compared to those in NVIDIA CUBLAS. Furthermore, they show a homogeneous performance for all Level-3 BLAS routines. In addition, we demonstrate how, by systematically applying a set of high-level methodologies, it is possible to obtain high-performance implementations for all Level-3 BLAS routines for graphics processors without the necessity of any low-level coding e ort. These homogeneous performance rates di er from those attained with the NVIDIA CUBLAS implementation, which only reaches high performance for a selected group of BLAS-3 routines (namely, the general matrix-matrix multiplication for a restricted set of particular cases).

Although the conclusions extracted from the evaluation of these alternative implementations can be directly applied to low-level programming codes, the developed routines are based on an existing BLAS implementation for graphics processors, improving portability and programmability. Given the large impact of the performance of the Level-3 BLAS implementations on higher-level linear algebra libraries, and the potential performance of the graphics processors on routines with high arithmetic intensity, our optimizations will exclusively address this BLAS level.

The chapter is divided as follows. Section 3.1 describes the basic concepts and nomenclature behind the BLAS speciﬁcation. Section 3.2 shows a full evaluation of the Level-3 BLAS routine implementations in NVIDIA CUBLAS, comparing the results with those attained with a highly tuned library in a current general-purpose multi-core processor. Section 3.3 presents a variety of techniques to tune the performance of those implementations; the attained performance results are reported in Section 3.4. Section 3.5 summarizes the main contributions of the chapter.

<<< < Предыдущая 1 2 3 4 5 6 7 8 910 / 4710 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
22.03.20161.06 Mб224MATER_3.doc
#
18.11.2019295.42 Кб0MATLAB-1.doc
#
19.11.2019203.78 Кб0MATLAB-2.doc
#
09.02.20153.49 Mб22MATLAB-3.doc
#
09.02.2015344.3 Кб10Matrices.pdf
#
22.03.20162.18 Mб14MatrixCUDAFranDissertation.pdf
#
21.09.2019139.22 Кб2matved.docx
#
24.04.201933.9 Mб2maximum.docx
#
09.02.2015360.31 Кб63MA_1_пособие.pdf
#
09.02.201534.57 Mб8MA_Kudriav1.pdf
#
09.02.201526.97 Mб11MA_Kudriav2.pdf