Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
MatrixCUDAFranDissertation.pdf
Скачиваний:
14
Добавлен:
22.03.2016
Размер:
2.18 Mб
Скачать

CHAPTER 3. BLAS ON SINGLE-GPU ARCHITECTURES

large matrices, as the dimensions are increased in 63 columns/rows at most. The implementation creates and sets to zeros a padded matrix in GPU memory for each operand matrix, and then transfers the data from main memory to the correct position in GPU memory.

As a result of the application of this technique, the performance attained by the kernel with padding is uniform for all matrix sizes, hiding the irregular performance of original NVIDIA CUBLAS implementation. There is some overhead associated with the cost of the padding process and the non-contiguous store of the data in GPU memory during the transference of the matrices; however, its e ect over the whole process is small, and the improvement when operating with dimensions that are not multiple of 64 greatly pays o , as can be observed in Figure 3.19.

3.5.Conclusions

Despite the amount of resources and e orts that NVIDIA has invested in the development of NVIDIA CUBLAS, this library o ers an irregular performance depending on the specific routine being used. The performance of each routine implementation is very heterogeneous, depending not only in the specific operation that is being executed, but also on the shape of the operands and their size.

The main contributions of this chapter include a detailed evaluation of each BLAS routine in NVIDIA CUBLAS, and a collection of new highly tuned implementations. The optimization techniques used follow a high-level approach, with three main benefits:

Better programmability, as no low-level optimizations are necessary, and the generation of new, high-performance codes is straightforward.

Easy code portability, to reuse the developed algorithms and take benefit from further optimizations of the underlying building blocks in the same or a di erent architecture.

Higher performance, by taking benefit of high-performance inner kernels and a e cient way of organizing the calculations.

The systematic derivation of several algorithmic variants and blocked, GEMM-based implementations allow the developer to gain insights that can be directly applied to lower-level codes (e.g. CUDA codes), with aspects that are critical to the performance of the optimized routines, such as the optimal block sizes, algorithmic variants or the transformation of the codes to achieve regular performances independently of the size of the operands (padding).

As a result, a full implementation of the basic routines in the Level-3 BLAS has been obtained, which delivers performances similar to those achieved by the highly tuned GEMM implementation, and speedups between 2.5× and 4× compared with their NVIDIA CUBLAS counterparts for square matrices, and 14x for rectangular matrices. The new codes have demonstrated their e ciency for both single and double precision, yielding similar speedups.

As the BLAS-3 are the basic building block for more complex algorithms, a detailed evaluation of their performance becomes necessary to understand the performance rates for those higher-level libraries; moreover, optimizations applied to the BLAS can be ported to higher-level codes to improve their performance.

72

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]