Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
MatrixCUDAFranDissertation.pdf
Скачиваний:
14
Добавлен:
22.03.2016
Размер:
2.18 Mб
Скачать

CHAPTER 4. LAPACK-LEVEL ROUTINES ON SINGLE-GPU ARCHITECTURES

processor, respectively. Sections 4.4 and 4.5 overview the theory and algorithms necessary to compute the LU factorization with partial pivoting and its implementations on the GPU, respectively. Section 4.6 introduces the iterative refinement strategy for the solution of linear systems, reporting experimental results and validating it in the framework of GPU computing. Section 4.7 proposes a set of improvements to improve performance of the reduction to tridiagonal form routines in LAPACK using the SBR package. Finally, Section 4.8 reports the main contributions from the chapter.

All experiments presented through the chapter were carried out using a single GPU (TESLA C1060) on PECO. The specific hardware and software details of the experimental setup were presented in Section 1.3.2.

4.1.LAPACK: Linear Algebra PACKage

The optimization of the routines in BLAS makes real sense if they are considered as the basic building block for libraries that perform more complex linear algebra operations. One of the most used libraries in this sense is LAPACK [108].

LAPACK o ers routines to solve fundamental linear algebra problems, and contains the state- of-the-art in numerical methods. In the same direction as BLAS, LAPACK o ers support for both dense and band matrices but, while BLAS is focused on the solution of basic linear algebra operations, LAPACK solves more complex problems, as, for example, linear equation systems, least square problems or eigenvalue and singular value problems.

LAPACK was created as a result of a project born at the end of the 80s. The main goal was to obtain a library with the same functionalities and improved performance than LINPACK [56] and EISPACK [107]. Those libraries, designed for vector processors, do not o er reasonable performance on current high performance processors, with segmented pipelines and complex memory hierarchies. This ine ciency is mainly due to the use of BLAS-1 routines, which does not exploit locality of reference, resulting in a sub-optimal usage of the memory subsystem. As a consequence, the o ered routines devote most of the time to data movements from/to memory, with the subsequent penalty on the performance. For the same reasons that made necessary the development of higher levels of BLAS, the significant gap between the memory and processor speeds in current architectures required the redesign of existing libraries to optimize the number of memory accesses and even the access patterns [37].

The performance boost attained by the routines in LAPACK is due to two main reasons:

The integration into the library of new algorithms that did not exist when older libraries were designed.

The redesign of existing and new algorithms to make an e cient use of the BLAS library.

The legacy implementation of LAPACK is public domain [108], and includes drivers to test and time the routines. As with BLAS, some hardware manufacturers have implemented specific versions of LAPACK tuned for their architectures; however, the improvements introduced in these implementations are usually not as important as those introduced in BLAS. As the performance of many routine implementations relies on the performance of the underlying BLAS implementation, high-level modifications, such as the tuning of the block size, are the most common tweaks implemented in the proprietary versions.

For multi-core architectures, LAPACK extracts the parallelism by invoking a parallel version of BLAS. That is, the routines in LAPACK do not include any kind of explicit parallelism in their codes, but use a multi-threaded implementation of BLAS.

74

4.1. LAPACK: LINEAR ALGEBRA PACKAGE

4.1.1.LAPACK and BLAS

The BLAS library o ers a collection of e cient routines for basic linear algebra operations. For this reason those routines are used by LINPACK and LAPACK. The former internally uses BLAS-1, but as they operate on scalars and vectors, they do not allow a proper parallelization or the e cient use of the memory hierarchy. LAPACK uses routines from all BLAS levels, mostly from Level-3 BLAS, as:

They yield high e ciencies on machines with hierarchical memory, not being limited by the slow data transfer rates between CPU and memory.

They implement operations with more workload, being candidates for an e cient parallelization.

The possibility of using BLAS-3 routines led to a redesign of the algorithms implemented by LAPACK to work with blocks or sub-matrices [59, 55]. By using blocked-based algorithms, many operations can be cast in terms of the most e cient routines in BLAS, and the possibilities of parallelization are improved in two ways: the internal parallelization of individual block operations and the possibility of processing several blocks in parallel [51].

Although LAPACK contains BLAS-3 based routines, in some cases there exist also alternative implementations based on BLAS-1 and BLAS-2, that can be used in case non-blocked codes are needed. For example, the Cholesky factorization of a dense symmetric positive definite matrix is implemented as a blocked routine (named POTRF). At each iteration, this routine performs a Cholesky factorization of the current diagonal block, using an unblocked version of the Cholesky factorization routine (named POTF2).

The internal usage of BLAS o ers other advantages:

Portability and e ciency. There exist di erent BLAS implementations developed for several architectures, often with dramatically di erent characteristics. These specific implementations allow a migration of the applications without modifications in the code, attaining high performances at the same time. This e ciency is supposed to hold for future architectures, as hardware manufacturers are usually the first interested in delivering highly tuned BLAS implementations as soon as new products are released.

Code legibility. The BLAS routines are well-known and their names clearly specify their functionality.

Parallelism. Multi-threaded versions of BLAS allow a direct parallelization of the routines in LAPACK that internally invoke them.

In summary, the e ciency of LAPACK depends mainly on two factors: first, the e ciency of the underlying BLAS implementations; second, the amount of flops performed in terms of BLAS-3 routines.

4.1.2.Naming conventions

The name of each LAPACK routine describes its functionality. Similarly to BLAS, the names of LAPACK routines follow the format XYYZZZ, where:

X: Indicates the data type involved in the routine: single-precision real data (S), doubleprecision real data (D), single-precision complex data (C), and double-precision complex data (Z).

75

CHAPTER 4. LAPACK-LEVEL ROUTINES ON SINGLE-GPU ARCHITECTURES

YY: Indicates the matrix type (or the type of the most significant matrix).

ZZZ: Indicates the specific operation to be performed, e.g., QRF for the QR factorization.

4.1.3.Storage schemes and arguments

LAPACK is based on the internal usage of BLAS routines. Thus, the storage schemes and are the same as those defined for BLAS (see section 3.1.3), and the arguments defining matrices are very similar. For more information about these details, see [9].

4.1.4.LAPACK routines and organization

LAPACK o ers three types of routines:

Driver routines: Devoted to the solution of complex problems; e.g., linear systems, least square problems or eigenvalue problems.

Computational routines: Perform basic operations, being invoked from the drivers to obtain partial results. Some examples include matrix factorizations or the solution of a linear system with an already factored matrix.

Utils: Auxiliary routines for the drivers and the computational routines. These can be divided in:

Routines implementing low-level functionalities; e.g., scaling a matrix.

Routines implementing sub-tasks of the blocked algorithms; e.g., unblocked versions of the algorithms.

Support routines, as for example routines to query the optimal block size for the target machine.

4.1.5.Porting LAPACK-level routines to graphics processors

Due to the heavy use of LAPACK-level routines from many scientific and engineering applications, remarkable e orts have been performed in order to exploit the capabilities of these processors.

Preliminary works were performed using the former Cg/OpenGL frameworks for GPGPU with remarkable results [84, 87, 66]. This interest was renewed with the introduction of CUDA. In [142], optimized codes for common decompositions (Cholesky, LU and QR) are developed and presented as near-optimal for some given GPU models. Similar approaches following low-level coding efforts have been proposed in the last decade [132, 131, 140]. All of them pursue high-performance implementations on GPU, with minor contributions from the programmability perspective.

We advocate here for a high-level approach similar to that presented for the BLAS in Chapter 3. The benefits (ease of development, independence from the underlying architecture, code reuse,. . . ) and key insights (evaluation of multiple algorithmic variants, thorough searches of optimal block sizes,. . . ) are similar to those obtained for the BLAS. The main goal is to demonstrate how the FLAME high-level approach is perfectly valid to attain high performance on LAPACK-level operations using the GPU as an accelerating device.

76

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]