Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

MatrixCUDAFranDissertation.pdf

Скачиваний:

Добавлен:

22.03.2016

Размер:

2.18 Mб

Скачать

☆

<<< < Предыдущая 1 2 3 4 5 6 7 8 9 10 11 1213 / 4713 14 15 16 17 18 19 20 21 22 23 24 25 > Следующая >>>

3.3. IMPROVEMENTS IN THE PERFORMANCE OF LEVEL-3 NVIDIA CUBLAS

Data transfer cost on PECO. Non-pinned memory

	10					100	(%)		10
	BW to GPU			Transfer time (%)
	BW to host

Bandwidth (Gbytes/s)	8					80	devoted to data transfers	Bandwidth (Gbytes/s)	8
	6					60			6
	4					40			4
	2					20			2

	0					0	Time		0
	0					0			0
	3000	6000	9000	12000	15000	18000
		Matrix size (m = n = k)

Data transfer cost on PECO. Pinned memory

BW to GPU			Transfer time (%)		100	(%)
BW to GPU			Transfer time (%)
BW to host

					80	transfers
						transfers
					60	to data
					40	to data
					40	Time devoted
					20
					0
					0
3000	6000	9000	12000	15000	18000
	Matrix size (m = n = k)

Figure 3.6: Percentage of time devoted to data transfers for the matrix-matrix multiplication. The plots include the e ective bandwidth o ered by the PCI-Express bus using non-pinned memory and pinned memory (left and right, respectively).

3.3.Improvements in the performance of Level-3 NVIDIA CUBLAS

After analyzing the previous results, we conclude that the routine implementations in NVIDIA CUBLAS exhibit heterogeneous performances. This irregular behavior depends not only on the particular operation (compare the performance of di erent BLAS-3 routines in Figure 3.5), but also on each speciﬁc case of each routine (see the behavior of each case for the GEMM routine in NVIDIA CUBLAS on Figure 3.3).

This irregular behavior justiﬁes one of the contributions of the work presented in this thesis. Low-level programming requires a deep knowledge of the underlying architectural details and adhoc programming languages. Very often, these details are closed to the scientiﬁc community; in other cases, even the hardware vendors are not able to fully exploit them for every implementation.

Developing a low-level approach from the initial stages of the library development is not always the correct choice. Several key decisions can have critical consequences on performance, and very frequently they are di cult to evaluate when programming at low level, driving to a loss of e ciency in the programming e ort.

In this section, we demonstrate how, by applying some high-level methodologies for the derivation and evaluation of dense linear algebra algorithms, it is possible to extract key insights that are crucial to deliver high-performance codes. Moreover, we also show that these codes are highly competitive from the performance perspective, attaining better performance results than those obtained with a low-level programming methodology.

Our implementations improve the performance of the NVIDIA CUBLAS routines by applying two main techniques. First, for each BLAS-3 routine we cast a major part of the operations in terms of the high performance GEMM routine. Second, we systematically derive a full set of blocked algorithmic variants and evaluate them to ﬁnd out the optimal one for each combination of parameter values and routine. Other parameters, such as the block size, are also systematically studied for each algorithmic variant. The result is a full evaluation and a high-performance implementation of the main routines in the Level-3 BLAS. The attained results clearly improve those in NVIDIA CUBLAS for the tested routines.

CHAPTER 3. BLAS ON SINGLE-GPU ARCHITECTURES

3.3.1. GEMM-based programming for the Level-3 BLAS

Improving the performance of the BLAS usually involves a deep knowledge of the internals of the target architecture, in this case the graphics architecture provided by NVIDIA, and a hard coding effort. This deep knowledge is often translated into a small set of complex and e cient codes adapted to the underlying architecture [142]. An illustrative example of this development complexity, as exposed in Section 3.2.1, is the NVIDIA CUBLAS implementation. While the GEMM implementation in NVIDIA CUBLAS delivers high performance, the rest of the Level-3 NVIDIA CUBLAS routines are sub-optimal, o ering worse e ciency, far from that attained for the general matrix-matrix multiplication. This performance analysis gives an idea of the necessary programming e ort in the development of tuned codes for the GPU.

However, there is a di erent way for developing high performance kernels for the BLAS. Following this alternative approach, the new implementations of the BLAS are based on a small set of tuned codes, or inner kernels, taking beneﬁt from their high performance to boost the performance of the new codes. This approach has been successfully applied in other high performance BLAS implementations, such as GotoBLAS [72] for general-purpose processors, attaining remarkable speedups without a ecting the development e ort. As a novelty, we apply a similar technique for the development of a full set of optimized dense linear algebra routines for modern graphics processors.

Besides increasing performance, the main advantage of the usage of tuned inner kernels is programmability. With this approach, the low-level architectural details have already been exploited in the inner-kernel implementation, and in consequence the programmer does not have to deal with them in the new codes. In addition, any further optimizations of the inner kernel will automatically translate into an improvement in the performance of higher-level codes, without any additional development e ort.

Given the performance results observed in Section 3.2.1, in our case it will be convenient to use the GEMM implementation in NVIDIA CUBLAS as the building block for developing the rest of the tuned BLAS. In other words, the main goal is to derive algorithms for the BLAS-3 routines that reformulate a big part of its operations in terms of the high-performance GEMM implementation in

NVIDIA CUBLAS.

This technique was ﬁrst applied for general-purpose architectures by K˚agstr¨om, Per Ling and Van Loan [88], but the GEMM-based concept can be ported to any architecture as long as there exists a tuned implementation of the general matrix-matrix multiplication routine, to build a complete and tuned BLAS-3 implementation.

Building a BLAS-3 based on this guidelines involves some remarkable advantages:

The GEMM routine is usually illustrative of the peak performance of the architecture, so manufacturers often devote big e orts to tune it, delivering near optimal implementations for their own architectures. Although the e ciency of the GEMM implementation in NVIDIA CUBLAS, is far from the theoretical peak of the architecture, this routine is considered to be a near-optimal implementation that exploits all the capabilities of the GPU for generalpurpose applications, and thus it is considered to reﬂect the practical peak performance of the architecture for this type of implementations [142]. Our GEMM-based implementations will inherit this high performance for each BLAS-3 routine.

With this approach, the optimizing e ort can be focused on just one routine (in this case, the matrix-matrix multiplication). The potential beneﬁts of the optimization work will be immediately e ective for the rest of the routines. Similarly, with the advent of new architectures,

3.3. IMPROVEMENTS IN THE PERFORMANCE OF LEVEL-3 NVIDIA CUBLAS

or even with future modiﬁcations of the current ones, only one routine should be adapted to the new hardware in order to get a global boost in performance.

Although this is a high-level approach, the insights gained from the evaluation of new routines (for example, optimal block sizes or algorithmic variants) can be directly applied to lower levels of abstraction (assembly code or CUDA code) for future native implementations.

The developed algorithms are portable to di erent architectures if a tuned optimization of the matrix-matrix multiplication exists for those architectures.

Modularity and performance are two key goals in the development of linear algebra algorithms. Usually, the derivation of blocked algorithms is the ﬁrst step towards the optimization of the performance of the implementation, improving data reuse and exploitation of the memory hierarchy in general-purpose architectures [70]. In addition to the performance boost, the usage of blocked algorithms also drives to modular algorithms, in which the operations performed on each block can be reformulated in terms of di erent BLAS-3 routines.

For example, the general matrix-matrix multiplication can be reformulated in terms of operations with sub-matrices of the original data. Figure 3.7 shows a possible algorithm for the matrix-matrix multiplication. At the beginning of the iteration, C0 and B0 have already been computed and used, respectively. In the current iteration, the next panel of matrix C is updated performing the operation C1 := C1 + AB1. Then, the preparation for the next iteration shifts C1 and B1 to the next blocks of data, increasing the number of columns in C0 and B0, since they contain more processed data at this point. This visual description, together with the three main stages involved in the algorithm (block deﬁnition, operations on blocks and data repartitioning), drives to the algorithm given in Figure 3.8 in FLAME notation.

As an illustrative implementation of the GEMM-based approach, we follow a similar blockoriented procedure as shown in Figure 3.7 for the blocked SYRK (symmetric rank-k update) implementation in BLAS-3. The conclusions and procedure applied to SYRK can be easily extended to the rest of BLAS-3.

The SYRK routine performs one of the two following rank-k updates:

C:= αAAT + βC,

C:= αAT A + βC,

where C is n × n, symmetric and stored as an upper or lower triangular matrix. The matrix A is n × k or k × n, respectively.

For each iteration of the blocked algorithm (see Figure 3.9), the update of the current block of columns of C involves two di erent tasks operating on two di erent blocks of C:

1.A diagonal block of the symmetric matrix C: C11 := C11 + A1AT1 .

2.An o -diagonal block of C: C21 := C21 + A2AT1 .

The operations involving the o -diagonal block of C can be carried out using the general matrixmatrix multiplication routine in BLAS (GEMM). Attending to the structure of the diagonal blocks, only those blocks must be updated using the rank-k update operation in BLAS (SYRK). Thus, in a major part of the iterations, most of the computation can be cast in terms of the highly tuned general matrix-matrix multiplication routine, and only certain blocks (in this case, the diagonal ones) need to be updated using the SYRK operation.

CHAPTER 3. BLAS ON SINGLE-GPU ARCHITECTURES

Step 1: Partitioning before iteration.


C0	C1	C2	A_	B0	B1	B2

Step 2: Computation in iteration.

C1	:=	A_	×	B1

Step 3: Progress of partitioning in preparation for the next iteration.


C0	C1	C2	A_	B0	B1	B2

Figure 3.7: A visualization of the algorithm for matrix-panel variant of GEMM.

3.3. IMPROVEMENTS IN THE PERFORMANCE OF LEVEL-3 NVIDIA CUBLAS

Algorithm: GEMM MP(A, B, C)

where

, C →

Partition

B →

L has 0 columns, CL has 0

columns

while n(BL) < n(B)

Determine block size b

Repartition

BR → B0

→

where

B1 has b

columns,

has

columns

C1 := C1 + AB1

Continue with

BR ← B0

CR ← C0

endwhile

Figure 3.8: Matrix-panel variant of the algorithm for computing the matrix-matrix multiplication.

The total number of ﬂops for the SYRK routine is kn(n + 1) [12]. Deﬁning a block size b, and considering the dimension n as a multiple of b, the number of ﬂops to update each diagonal block of C (with dimension b × b) in the algorithm in Figure 3.9 is:

fldiag = kb(b + 1) ≈ kb2 ﬂops.

(3.1)

As there are nb diagonal blocks in C, the total number of ﬂops cast in terms of the SYRK routine in the whole procedure is:

	n
flinner syrk =	b fldiag = nk(b + 1) ≈ nkb ﬂops.	(3.2)

Therefore, the remaining operations are exclusively executed as general matrix-matrix opera-

tions, being:
flgemm = kn(n + 1) − nk(b + 1) = kn2 − nkb ﬂops.	(3.3)

Observe that, while the number of ﬂops cast in terms of SYRK increases linearly with n (provided b n), the number of ﬂops cast in terms of GEMM increases quadratically with this dimension. This relationship, or GEMM-fraction of the algorithm (in a similar way as the Level-3 fraction in [70]), will be critical for the ﬁnal performance attained by the algorithm. From the results obtained from the evaluation of GEMM in the NVIDIA CUBLAS implementation, the exploitation of this amount of GEMM-based ﬂops is expected to have a considerable impact in the ﬁnal performance results of the new GEMM-based SYRK implementation.

CHAPTER 3. BLAS ON SINGLE-GPU ARCHITECTURES

Step 1: Partitioning before iteration.


C00			A0
C00			A0
C10	C11		A1	AT	AT	AT
				0	1	2
C20	C21	C22	A2

Step 2: Computation in iteration.


C11		A1		A1T
	:=		×
	:=		×
C21		A2

Step 3: Progress of partitioning in preparation for the next iteration.

C00
C10	C11
C20	C21	C22

AT0 AT1 AT2 A1

Figure 3.9: A visualization of the algorithm for matrix-panel variant of SYRK.

<<< < Предыдущая 1 2 3 4 5 6 7 8 9 10 11 1213 / 4713 14 15 16 17 18 19 20 21 22 23 24 25 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
22.03.20161.06 Mб224MATER_3.doc
#
18.11.2019295.42 Кб0MATLAB-1.doc
#
19.11.2019203.78 Кб0MATLAB-2.doc
#
09.02.20153.49 Mб22MATLAB-3.doc
#
09.02.2015344.3 Кб10Matrices.pdf
#
22.03.20162.18 Mб14MatrixCUDAFranDissertation.pdf
#
21.09.2019139.22 Кб2matved.docx
#
24.04.201933.9 Mб2maximum.docx
#
09.02.2015360.31 Кб63MA_1_пособие.pdf
#
09.02.201534.57 Mб8MA_Kudriav1.pdf
#
09.02.201526.97 Mб11MA_Kudriav2.pdf