Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
MatrixCUDAFranDissertation.pdf
Скачиваний:
14
Добавлен:
22.03.2016
Размер:
2.18 Mб
Скачать

CHAPTER 3

BLAS on single-GPU architectures

The Basic Linear Algebra Subprograms (BLAS) are the fundamental building blocks for the development of complex dense linear algebra applications. In this chapter, the implementation of the Level-3 BLAS specification from NVIDIA (CUBLAS) is evaluated. The major contribution though is the design and evaluation of a new, faster implementation of the main routines in Level 3 BLAS. The aim of these new implementations is twofold: First, to improve the performance of the existing BLAS implementations for graphics processors. Second, to illustrate a methodology to systematically evaluate a number of parameters that become crucial to attain high performance. To achieve these goals, a set of algorithmic variants that take benefit from a reduced number of existing high-performance BLAS kernels is presented, together with a detailed evaluation of the performance of those new implementations.

As a result, our new implementations attain remarkable speedups compared to those in NVIDIA CUBLAS. Furthermore, they show a homogeneous performance for all Level-3 BLAS routines. In addition, we demonstrate how, by systematically applying a set of high-level methodologies, it is possible to obtain high-performance implementations for all Level-3 BLAS routines for graphics processors without the necessity of any low-level coding e ort. These homogeneous performance rates di er from those attained with the NVIDIA CUBLAS implementation, which only reaches high performance for a selected group of BLAS-3 routines (namely, the general matrix-matrix multiplication for a restricted set of particular cases).

Although the conclusions extracted from the evaluation of these alternative implementations can be directly applied to low-level programming codes, the developed routines are based on an existing BLAS implementation for graphics processors, improving portability and programmability. Given the large impact of the performance of the Level-3 BLAS implementations on higher-level linear algebra libraries, and the potential performance of the graphics processors on routines with high arithmetic intensity, our optimizations will exclusively address this BLAS level.

The chapter is divided as follows. Section 3.1 describes the basic concepts and nomenclature behind the BLAS specification. Section 3.2 shows a full evaluation of the Level-3 BLAS routine implementations in NVIDIA CUBLAS, comparing the results with those attained with a highly tuned library in a current general-purpose multi-core processor. Section 3.3 presents a variety of techniques to tune the performance of those implementations; the attained performance results are reported in Section 3.4. Section 3.5 summarizes the main contributions of the chapter.

37

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]