Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
MatrixCUDAFranDissertation.pdf
Скачиваний:
14
Добавлен:
22.03.2016
Размер:
2.18 Mб
Скачать

CHAPTER 3. BLAS ON SINGLE-GPU ARCHITECTURES

All experiments presented through the chapter were carried out using a single GPU (TESLA C1060) on PECO. The specific hardware and software details of the experimental setup were presented in Section 1.3.2.

3.1.BLAS: Basic Linear Algebra Subprograms

Fundamental dense linear algebra problems, such as the solution of systems of linear equations or eigenvalue problems, arise in a wide variety of applications in science and engineering. Chemical simulations, automatic control or integrated circuit design are just some examples of application areas in which linear algebra operations usually conform the computationally most expensive part.

A set of basic operations frequently appear during the solution of these problems, such as the scalar product of two vectors, the solution of a triangular linear system or the product of two matrices. This set of basic linear algebra routines is grouped under the name BLAS (Basic Linear Algebra Subprograms). In fact, the development of BLAS was a joint e ort of experts from diverse areas, so that the final specification covered the basic requirements that appear in many fields of science.

The BLAS specification was originally a trade-o between functionality and simplicity. The number of routines and their parameters were designed to be reasonable in number; but, simultaneously the functionality was intended to be as rich as possible, covering those routines that often appear in complex problems. An illustrative example of the flexibility of the specification is the representation of a vector: the elements of the vector do not have to be stored contiguously in memory; instead, the corresponding routines provide a parameter to define the physical separation between two logically consecutive elements in the vector.

Since the first definition of the specification, BLAS has been of great relevance in the solution of linear algebra problems. Its reliability, flexibility and the e ciency of the existing implementations allowed the emergence of other libraries that made internal usage of the BLAS routines. In addition to reliability and e ciency, there are other advantages that make the use of BLAS so appealing:

Code legibility: the names of the routines denote their internal functionality. This standardization makes code simpler and more readable.

Portability: by providing a well-defined specification, the migration of the codes to other machines is straightforward. Given a tuned implementation for the target machine, the ported implementations will stay optimized and the resulting codes will remain highly e cient.

Documentation: there is a rich documentation available for each BLAS routine.

There is a generic implementation of the BLAS available since its original definition [106]. This reference implementation o ers the full functionality of the specification, but without any further optimization specific to a particular hardware. However, the real value of BLAS lies with the tuned implementations developed for di erent hardware architectures. Since the publication of the specification, the development of tuned implementations adapted to each hardware architecture has been a task for either processor manufacturers or the scientific community. Today, the quality of vendor-specific implementations are usually employed as demonstrations of the full potential of an specific processor. There exist proprietary implementations from AMD (ACML) [4], Intel (MKL) [85], IBM (ESSL) [62] or SUN (SUN Performance Library) [111] for general-purpose architectures. Vendor-specific implementations are emerging for novel specific-purpose architectures, such as NVIDIA CUBLAS for the NVIDIA graphics processors, or CSXL for the ClearSpeed

38

3.1. BLAS: BASIC LINEAR ALGEBRA SUBPROGRAMS

boards. In general, the code for each routine in those implementations is designed to optimally exploit the resources of a given architecture. Independent third-party implementations, such as GotoBLAS [71] or ATLAS [145] also pursue the optimizations of the implementations on generalpurpose processors; as of today, no independent implementations of BLAS for graphics processors exist that improve the performance of the full NVIDIA CUBLAS implementation, although partial solutions have been proposed for specific routines [142].

BLAS are usually implemented using C and Fortran but, in many cases, the use of assembly language allows fine-grained optimizations to extract the full potential of the architecture. In specific-purpose architectures, the use of ad-hoc languages is a must. As an example, the internal NVIDIA CUBLAS implementation makes heavy use of the CUDA language and specific tuning techniques that are exclusive of this kind of hardware.

3.1.1.BLAS levels

The development of BLAS is closely bound to the development and evolution of the hardware architectures. At the beginning of the BLAS development, in early 1970s, the most common HPC computers were based on vector processors. With those architectures in mind, BLAS was initially designed as a group of basic operations on vectors (Level-1 BLAS, or BLAS-1). The ultimate goal of the BLAS specification was to motivate the hardware vendors to develop fully optimized versions following the standard specification.

In 1987, the BLAS specification was improved with a set of routines for matrix-vector operations, usually known as Level-2 BLAS (or BLAS-2). Both the number of operations and the amount of data involved on these routines are of quadratic order. As for the first BLAS routines, a generic BLAS-2 implementation was published after the specification. One of the most remarkable observations of these early implementations was the column-wise access to the data in the matrices. E cient implementations of BLAS-2 can reduce the number of memory accesses by exploiting data reuse on registers inside the processor.

The increase in the gap between processor and main memory speeds [129] implied the appearance of architectures with multiple levels of cache memory, and thus a hierarchical organization of the system memory. With the popularization of those architectures, it was accepted that the libraries built on top of BLAS-1 and BLAS-2 routines would never attain high performances when ported to the new architectures: the main bottleneck of BLAS-1 and BLAS-2 codes is memory. While on these routines the ratio between the number of operations and the number of memory accesses is O(1), the ratio between the processor speed and the memory speed was much higher in those emerging architectures. The main implication of this speed gap is that memory needs more time to feed the processor with data than that needed by the processor to process them. Thus, the performance of the BLAS-1 and BLAS-2 routines is mainly limited by the speed at which the memory subsystem can provide data to the execution path.

The third level of BLAS (Level-3 BLAS or BLAS-3) was defined in 1989 in response to the problems exposed above. The specification proposed a set of operations featuring a cubic number of floating-point operations on a quadratic amount of data. This unbalance between the number of calculations and memory accesses allows a better exploitation of the principle of locality in architectures with hierarchical memories, if carefully designed algorithms are applied. In practice, this hides the memory latency, and o ers performances close to the peak performance delivered by the processor. From the algorithmic viewpoint, these higher performances are attained by developing algorithms by blocks. These algorithms partition the matrix into sub-matrices (or blocks), grouping memory accesses and increasing the locality of reference. By exploiting this property, the possibility of finding data in higher (and faster) levels of the memory hierarchy increases, and thus the memory

39

CHAPTER 3. BLAS ON SINGLE-GPU ARCHITECTURES

access penalty is reduced. The size and organization of cache memories determines the optimal block size. Thus, exhaustive experimental searches are usually carried out to find those optimal values. Inside each matrix, the column-wise access was also recommended by the specification [57].

To sum up, the BLAS are divided into three levels:

Level 1: The number of operations and the amount of data increase linearly with the size of the problem.

Level 2: The number of operations and the amount of data increase quadratically with the size of the problem.

Level 3: The number of operations increases cubically with the size of the problem, while the amount of data increases only quadratically.

From the performance perspective, the main reason of this classification is the ratio between the number of operations and the amount of data involved on them. This ratio is critical in architectures with a hierarchical memory system, widely extended nowadays.

With the emergence of modern shared memory multi-core and many-core architectures, the parallel BLAS implementations have received further attention, and big e orts have been devoted to adapt the implementations to these architectures. Usually, computations are parallelized by implementing a multi-threaded code, and distributing the operations among the available processing units. In these implementations, the advantage of using Level-3 BLAS routines, is even more dramatic, as the memory bandwidth becomes a stronger bottleneck when shared by several processors.

From the performance viewpoint of the parallel implementations, we can conclude that:

The performances of the BLAS-1 and BLAS-2 routines are dramatically limited by the pace at which memory can feed the processor with data.

The Level-3 BLAS is more e cient, as for each memory access more calculations can be performed, attaining performances near the peak of the processor and near-optimal speedups in many operations. In addition, BLAS-3 routines can deliver higher parallel e ciencies, with a better adaptation to multi-core and many-core architectures.

3.1.2.Naming conventions

The concise, descriptive and homogeneous nomenclature of the BLAS routines proposed in the specification is one of its main advantages. The names of the BLAS routines consist of four to six characters. The main goal in the naming convention BLAS is to provide enough information on the functionality of the routine. In general, for the BLAS-2 and BLAS-3 routines, their names present the form XYYZZZ, where:

X: Denotes the data type involved in the operation. Possible values are:

S: single-precision floating-point number.

D: double-precision floating-point number.

C: simple-precision complex number.

Z: double-precision complex number.

40

3.1. BLAS: BASIC LINEAR ALGEBRA SUBPROGRAMS

YY: Denotes the type of matrix involved in the operation; the most common values are:

GE: general matrix.

SY: symmetric matrix.

TR: triangular matrix.

GB: general band matrix.

SB: symmetric band matrix.

ZZZ: Denotes the type of operation performed by the routine. With a length of two or three characters, common examples are MM for a matrix-matrix product, MV for a matrix-vector product or RK for a rank-k update.

BLAS-1 routines do not follow the above scheme. As the operands for these operations are exclusively vectors, the only necessary information is the operation to be performed. In general, the name of these routines only informs about the data type (single or double precision, real or complex data), and the operation to be performed.

3.1.3.Storage schemes

The BLAS implementations usually deal with three di erent storage types:

Canonical column-wise storage: The matrix is stored by columns as is usual, for example, in Fortran. Consecutive columns are stored contiguously in memory. This scheme is common in operations involving dense general matrices.

Band storage: Used for band matrices, it only stores elements inside the band.

Symmetric band storage: Used for symmetric or Hermitian band matrices, it only stores elements inside the band of the upper or lower triangular part of the matrix.

3.1.4.Overview of the Level-3 BLAS operations

The Level-3 BLAS targets matrix-matrix operations. The functionality of the Level-3 BLAS was designed to be limited. For example, no routines for matrix factorizations are included, as they are supported by higher-level libraries, such as LAPACK, which implements blocked algorithms for those purposes making use of Level-3 BLAS whenever possible. Instead, the Level-3 BLAS are intended to be a set of basic matrix algebra operations from which the developer is capable of implementing more complex routines.

The routines o ered by Level-3 BLAS in real arithmetic support the following functionality:

Matrix-matrix products (routines xGEMM for general matrices, and xSYMM for symmetric matrices):

C

:=

αAB

+

βC

C

:=

αAT B

+

βC

C

:=

αABT

+

βC

C

:=

αAT BT

+

βC

41

CHAPTER 3. BLAS ON SINGLE-GPU ARCHITECTURES

Rank-k and rank-2k updates of a symmetric matrix C (routines xSYRK and xSYR2K, respectively):

C

:=

αAA

+

βC

 

 

C := αAT A

+

βC

 

 

C

:=

αABT

+

αBAT

+

βC

C

:=

αAT B

+

αBT A

+

βC

Multiplication of a general matrix C by a triangular matrix T (routine xTRMM):

C

:=

αT C

C

:=

αT T C

C

:=

αCT

C

:=

αCT T

Solution of triangular systems of linear equations with multiple right-hand sides (routines xTRSM):

B := αT −1B

B := αT −T B

B:= αBT −1

B:= αBT −T

In the definitions above, α and β are scalars, matrices A, B and C are dense general matrices (symmetric in the corresponding routines), and T is an upper or lower triangular matrix. Analogous routines provide support for double precision and real/complex arithmetic.

Table 3.1 summarizes the Level-3 BLAS routines operating in real arithmetic. The names of the routines follow the conventions of the rest of the BLAS specification, with the first character denoting the data type of the matrix (S for single precision, D for double precision), second and third character denoting the type of matrix involved (GE, SY or TR for general, symmetric or triangular matrices, respectively) and the rest of the character denoting the type of operation (MM for matrixmatrix product, RK and R2K for rank-k and rank-2k updates of a symmetric matrix, respectively, and SM for the solution of a system of linear equations for a matrix of right-hand sides).

To support the whole set of operations for each routine, all BLAS share a convention for their arguments Table 3.2 summarizes the available arguments for the Level-3 BLAS just described. For each routine, the parameters follow necessarily this order:

1.Arguments specifying options.

The available option arguments are SIDE, TRANSA, TRANSB, TRANS, UPLO and DIAG, and their functionality and possible values are:

42

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]