Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
MatrixCUDAFranDissertation.pdf
Скачиваний:
14
Добавлен:
22.03.2016
Размер:
2.18 Mб
Скачать

CHAPTER 7

Conclusions

7.1.Conclusions and main contributions

The main goal of the dissertation is the design, development and evaluation of programming strategies to improve the performance of existing dense linear algebra libraries on systems based on graphics processors. This primary goal has been pursued keeping some important premises in the framework of dense linear algebra computations:

Performance: All linear algebra routines, independently from the target architecture, have attained important speedups compared with their counterpart tuned versions for modern CPUs. When possible, a comparison with tuned GPU implementations of the routines (such as NVIDIA CUBLAS in the case of single-GPU BLAS developments) has demonstrated that the proposed implementations also attain better performance results for all routines. These results validate the application of the FLAME technology, and the high-level approach in the development of new codes, not only from the programmability point of view, but also from the performance perspective.

Programmability: As a second contribution of the thesis, we have applied the FLAME methodology, and a high-level developing approach, to di erent GPU-based architectures. Remarkably, no low-level CUDA codes have been written in the framework of the routines developed for single-GPU architectures, systems based on multiple GPUs or clusters of GPUs, yet high performances were obtained. This gives an idea of the benefits of the application of FLAME also for novel architectures, and its benefits on programmability, with independence from the target architecture in which those routines are executed.

Versatility: As a third contribution, the separation of concerns introduced by FLAME (and, similarly, by PLAPACK) enables the application of strategies similar to those developed in this dissertation to virtually any accelerator-based architecture similar to those used in our evaluations. Our purpose was to detach the programmer’s view from the underlying architecture. Proceeding in this manner, programming models and user’s codes remain basically unmodified for accelerated or non-accelerated platforms. In addition, the concepts approached at the architectural level are common for other types of accelerated platforms, not only GPU-based systems.

201

CHAPTER 7. CONCLUSIONS

In addition to the proposed high-level methodologies and optimization techniques, as a result of the developed work we have obtained a full family of implementations for BLAS-level and LAPACK-level routines on three di erent GPU-based platforms: single-GPU, multi-GPU, and clusters of GPUs. Furthermore, many of the techniques described in this dissertation are general enough to be applied to other linear algebra implementations if a full, optimized implementation of BLAS-level and LAPACK-level implementations is desired on present and future accelerator-based architectures.

In general, the developed work has proved that a high-level approach, as that pursued in the framework of the FLAME project, is perfectly valid to port existing linear algebra libraries to novel architectures and to attain high performance at the same time. In addition, this type of methodologies presents two main advantages. First, ease of code development: FLAME provides APIs and tools to easily perform the transition from algorithms to code [26, 126]. Second, flexibility and portability of the developed codes: similar simultaneous and independent studies have addressed low-level coding strategies for linear algebra implementations on the same target architectures as ours. Although performance results for those works are remarkable, further coding e orts will be necessary if those codes need to be adapted to future accelerator-based architectures. With our methodologies, the programming low-level e ort is focused on a reduced number of basic routines, on top of which the rest of the codes are developed.

Whether GPUs will remain as a reference in accelerator-based platforms is not clear at this time. Manufacturers are working together to apply unified programming models, such as OpenCL. From the software perspective, these e orts should derive in common software developments for all type of accelerator-based platforms. However, the programming e orts required to tune these codes to each specific platform, will remain nonnegligible. In addition, the extension of the developed routines to more sophisticated platforms, such as systems with multiple accelerators or clusters of GPUs, will remain a non-trivial task. These problems will be even more severe if heterogeneous systems, with accelerators of di erent nature sharing common resources, appear in response to the heterogeneity of some applications.

We have validated the proposed methodologies on three di erent architectures: systems with one GPU, systems with multiple GPUs, and clusters of GPUs. The general conclusions extracted from the programmability and performance points of view are common for the three architectures. However, several specific contributions have been developed for each platform. The following sections detail those specific contributions.

7.1.1.Contributions for systems with one GPU

A full set of implementations for BLAS-level routines and for some representative LAPACK-level routines has been developed and evaluated on systems with one GPU.

A complete implementation and evaluation of the main Level-3 BLAS routines has been performed. Experimental results reveal how, taking advantage of the high-performance of the GEMM implementation in NVIDIA CUBLAS, all derived BLAS routines outperform the corresponding implementations in the library from NVIDIA. The programming e ort is reduced as no low-level CUDA code is necessary. In addition, as shown, a similar methodology is common for every routine in Level-3 BLAS; results have been validated for single and double precision, extracting similar qualitative improvements in performance. Following the FLAME philosophy, a systematic development and evaluation of several algorithmic variants for each routine, together with a thorough evaluation to find the optimal block size for each one, o ers a complete family of tuned implementations for each BLAS routine. From those results results, the optimal combination of block size and algorithmic variant has been determined.

202

7.1. CONCLUSIONS AND MAIN CONTRIBUTIONS

Performance results improve those of the proprietary library from NVIDIA (NVIDIA CUBLAS), attaining remarkable speedups. Despite the e orts devoted by the company towards the optimization of the library, the advantages of our high-level approach are obvious: speedups between 2x and 4x are attained for all the evaluated routines in Level-3 BLAS, achieving peak speedups of about 14x in specific cases (for example, rectangular matrices).

Several contributions are derived from the implementation of LAPACK-level routines on systems with one GPU. As for the BLAS implementation, a systematic derivation and evaluation of several algorithmic variants has been performed for the Cholesky factorization and the LU factorization with partial pivoting. The detailed study of the optimal block size for each variant yield remarkable speedups compared with tuned multi-threaded implementations on modern multi-core processors.

For the first time, a GPU-accelerated system is viewed as a hybrid architecture; in response to this, we propose a set of hybrid implementations, in which each sub-operation in the factorizations is executed in the most suitable computing resource. Performance results for the hybrid approach improve the basic implementations in all cases.

Regarding precision, a previous evaluation of the BLAS routines reveals a dramatic performance drop in modern graphics processors depending on the floating-point arithmetic precision employed. To deal with this limitation, we apply a mixed precision iterative refinement approach which combines the performance potential of modern GPUs with the accuracy in results of double precision in the solution of linear systems using the Cholesky factorization or the LU factorization.

The reduction to tridiagonal form has been implemented on systems with one GPU in order to illustrate the capabilities of the architecture for other type of LAPACK-level implementations. In this case, we have performed a comparison between the implementations in LAPACK and our approach based on the SBR toolkit, following hybrid approach. Performance results have shown that the latter option, using the GPU to accelerate the critical parts of the algorithm, outperforms the LAPACK implementation.

7.1.2.Contributions for multi-GPU systems

On systems with multiple GPUs sharing common resources, the programming e ort is multiplied due to the existence of several independent memory spaces, one per GPU attached to the system, plus that of main memory. Given that no direct GPU-GPU communication is possible, and that the PCIExpress bus becomes the main bottleneck as the number of GPUs is increased, we propose a runtime that transparently and e ciently manages data allocation and communication between GPUs. In addition, the runtime orchestrates the execution of the routines by exploiting task-level parallelism in a transparent manner from the programmer point of view.

In this sense, a second major contribution in our work is the consideration of each GPU in the system as a single processing unit, delegating further parallelization levels to ad-hoc GPU implementations, such as NVIDIA CUBLAS.

Performance results for a set of representative BLAS routines and for the Cholesky factorization reveal the benefits of our approach. We propose a set of improvements to boost performance, namely the consideration of each GPU memory as a software-managed cache of recently-used blocks, or the application of well-known techniques in computer architecture such as a write-back policy that allows inconsistencies of blocks of data between main memory and GPU memory, or a write-invalidate policy to keep coherence of data blocks among di erent memory spaces. The ultimate goal of these policies is the reduction of the number of data movements between accelerators and main memory. These techniques have demonstrated their benefits for all the evaluated operations.

The proposed approach is fundamentally di erent from that of the MAGMA project. First, the integration of our methodology into libflame provides a straightforward, full port of all the

203

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]