Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

MatrixCUDAFranDissertation.pdf

Скачиваний:

Добавлен:

22.03.2016

Размер:

2.18 Mб

Скачать

☆

<<< < Предыдущая 2 3 4 5 6 7 8 9 10 11 12 1314 / 4714 15 16 17 18 19 20 21 22 23 24 25 26 > Следующая >>>

3.3. IMPROVEMENTS IN THE PERFORMANCE OF LEVEL-3 NVIDIA CUBLAS

Name	Letter	Dimensions
Matrix	M	Both dimensions are large.

Panel	P	One of the dimensions is small.

Block	B	Both dimensions are small.

Table 3.4: Di erent names given to the partitioned sub-matrices according to their shapes.

3.3.2.Systematic development and evaluation of algorithmic variants

The GEMM-based approach exploits the usual near-optimal performance of the GEMM routine in many BLAS implementations. In addition to the performance of the speciﬁc kernels used in each routine, a second key factor in the ﬁnal performance achieved for each implementation is the shape of the blocks involved in those inner kernels. For general-purpose architectures, those shapes have a direct impact on important factors such as cache memory exploitation, number of TLB misses or register usage [72], to name only a few.

For a given linear algebra operation, there usually exists a set of loop-based algorithms. The number of ﬂops involved is exactly the same for all algorithmic variant, operating on the same data. The di erence between them is basically the order in which operations are performed and, in blocked algorithms, the performance of the speciﬁc underlying BLAS operations.

In our case, the performance of the routines in NVIDIA CUBLAS in general, and that of the GEMM implementation in particular, are considerably inﬂuenced by the shape of its operands, as shown in the performance evaluation of Section 3.2.1, Figure 3.4. Thus, it is desirable to have multiple algorithmic variants at the disposal of the library user so that the best algorithm can be chosen for each situation. Even though the operations ultimately performed by each algorithmic variant are the same, there are many factors di cult to predict or control that can be exposed and solved by the di erent algorithmic variants.

Let us illustrate this with an example: the derivation of di erent algorithms for the GEMM operation. Given an m × n matrix X, consider two di erent partitionings of X into blocks of rows and blocks of columns:

X¯1

X =

· · ·

Xn−1

(3.4)

Xm−1

where Xj has nb columns and Xi has mb rows. For simplicity, we will assume hereafter that the

number of rows/columns of X is an integer multiple of both mb and nb.

Depending on the dimensions of the sub-matrices (blocks) obtained in the partitioning process, we denote them as matrices, panels or blocks, as detailed in Table 3.4. The implementation of the matrix-matrix multiplication can be decomposed into multiplications with those sub-matrices. Thus, by performing di erent row and column partitioning schemes for the matrices A and B, it is possible to derive three di erent algorithmic variants, depending on the speciﬁc shapes of the sub-matrices involved. The name received by each variant depends on the shapes of the operands in them:

Matrix-panel variant (GEMM MP): Figure 3.7 represents an algorithm in which, for each iteration, the computation involves a matrix (A) and a panel of columns (B1), to obtain

CHAPTER 3. BLAS ON SINGLE-GPU ARCHITECTURES

a new panel of columns of C. The algorithmic variant that proceeds this way is shown in Figure 3.8.

Panel-matrix variant (GEMM PM): In each iteration, a panel of rows of matrix A is multiplied by the whole matrix B, to obtain a new panel of rows of the result matrix C.

Panel-panel variant (GEMM PP): In the each iteration, a panel of columns of matrix A and a panel of rows of matrix B are multiplied to update the whole matrix C.

The algorithms in FLAME notation for the last two variants are shown in Figure 3.10.

Algorithm: GEMM PM(A, B, C)

Partition A →	AB !	, C →	CB !
	AT		CT

where AT has 0 rows, CT has 0 rows while m(AT ) < m(A) do

Determine block size b Repartition

! →

→

where

b rows, C1 has b rows

A has

C1 := C1 + A1B

Continue with

! ←

←

endwhile

Algorithm: GEMM PP(A, B, C)

Partition

A → AL

AR , B →

where

AL has 0 columns, BT has 0 rows

while n(AL) < n(A) do

Determine block size b

Repartition

AR → 0A0

A2 ,

BB ! →

where A1 has b columns, B1 has b rows

C := C + A1B1

Continue with

AL AR ← A0 A1 A2 ,

	BT			B1
		!			B0
			←	B2
	BB		←	B2

endwhile

Figure 3.10: Algorithms for computing the matrix-matrix product. Panel-matrix variant (left), and panel-panel variant (right).

Several algorithmic variants with matrix-panel, panel-matrix and panel-panel partitionings are also possible for the rest of BLAS-3. To show this, we propose three di erent algorithmic variants for the SYRK operation C := C − AAT (with only the lower triangular part of C being updated), presented in FLAME notation in Figure 3.11. For each iteration of the loop, the algorithms in that ﬁgure also include the BLAS-3 kernel that has to be invoked in order to perform the corresponding operation.

<<< < Предыдущая 2 3 4 5 6 7 8 9 10 11 12 1314 / 4714 15 16 17 18 19 20 21 22 23 24 25 26 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
22.03.20161.06 Mб224MATER_3.doc
#
18.11.2019295.42 Кб0MATLAB-1.doc
#
19.11.2019203.78 Кб0MATLAB-2.doc
#
09.02.20153.49 Mб22MATLAB-3.doc
#
09.02.2015344.3 Кб10Matrices.pdf
#
22.03.20162.18 Mб14MatrixCUDAFranDissertation.pdf
#
21.09.2019139.22 Кб2matved.docx
#
24.04.201933.9 Mб2maximum.docx
#
09.02.2015360.31 Кб63MA_1_пособие.pdf
#
09.02.201534.57 Mб8MA_Kudriav1.pdf
#
09.02.201526.97 Mб11MA_Kudriav2.pdf