Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

MatrixCUDAFranDissertation.pdf

Скачиваний:

Добавлен:

22.03.2016

Размер:

2.18 Mб

Скачать

☆

<<< < Предыдущая 5 6 7 8 9 10 11 12 13 14 15 1617 / 4717 18 19 20 21 22 23 24 25 26 27 28 29 > Следующая >>>

CHAPTER 3. BLAS ON SINGLE-GPU ARCHITECTURES

large matrices, as the dimensions are increased in 63 columns/rows at most. The implementation creates and sets to zeros a padded matrix in GPU memory for each operand matrix, and then transfers the data from main memory to the correct position in GPU memory.

As a result of the application of this technique, the performance attained by the kernel with padding is uniform for all matrix sizes, hiding the irregular performance of original NVIDIA CUBLAS implementation. There is some overhead associated with the cost of the padding process and the non-contiguous store of the data in GPU memory during the transference of the matrices; however, its e ect over the whole process is small, and the improvement when operating with dimensions that are not multiple of 64 greatly pays o , as can be observed in Figure 3.19.

3.5.Conclusions

Despite the amount of resources and e orts that NVIDIA has invested in the development of NVIDIA CUBLAS, this library o ers an irregular performance depending on the speciﬁc routine being used. The performance of each routine implementation is very heterogeneous, depending not only in the speciﬁc operation that is being executed, but also on the shape of the operands and their size.

The main contributions of this chapter include a detailed evaluation of each BLAS routine in NVIDIA CUBLAS, and a collection of new highly tuned implementations. The optimization techniques used follow a high-level approach, with three main beneﬁts:

Better programmability, as no low-level optimizations are necessary, and the generation of new, high-performance codes is straightforward.

Easy code portability, to reuse the developed algorithms and take beneﬁt from further optimizations of the underlying building blocks in the same or a di erent architecture.

Higher performance, by taking beneﬁt of high-performance inner kernels and a e cient way of organizing the calculations.

The systematic derivation of several algorithmic variants and blocked, GEMM-based implementations allow the developer to gain insights that can be directly applied to lower-level codes (e.g. CUDA codes), with aspects that are critical to the performance of the optimized routines, such as the optimal block sizes, algorithmic variants or the transformation of the codes to achieve regular performances independently of the size of the operands (padding).

As a result, a full implementation of the basic routines in the Level-3 BLAS has been obtained, which delivers performances similar to those achieved by the highly tuned GEMM implementation, and speedups between 2.5× and 4× compared with their NVIDIA CUBLAS counterparts for square matrices, and 14x for rectangular matrices. The new codes have demonstrated their e ciency for both single and double precision, yielding similar speedups.

As the BLAS-3 are the basic building block for more complex algorithms, a detailed evaluation of their performance becomes necessary to understand the performance rates for those higher-level libraries; moreover, optimizations applied to the BLAS can be ported to higher-level codes to improve their performance.

<<< < Предыдущая 5 6 7 8 9 10 11 12 13 14 15 1617 / 4717 18 19 20 21 22 23 24 25 26 27 28 29 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
22.03.20161.06 Mб224MATER_3.doc
#
18.11.2019295.42 Кб0MATLAB-1.doc
#
19.11.2019203.78 Кб0MATLAB-2.doc
#
09.02.20153.49 Mб22MATLAB-3.doc
#
09.02.2015344.3 Кб10Matrices.pdf
#
22.03.20162.18 Mб14MatrixCUDAFranDissertation.pdf
#
21.09.2019139.22 Кб2matved.docx
#
24.04.201933.9 Mб2maximum.docx
#
09.02.2015360.31 Кб63MA_1_пособие.pdf
#
09.02.201534.57 Mб8MA_Kudriav1.pdf
#
09.02.201526.97 Mб11MA_Kudriav2.pdf