Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
MatrixCUDAFranDissertation.pdf
Скачиваний:
14
Добавлен:
22.03.2016
Размер:
2.18 Mб
Скачать

CHAPTER 6

Matrix computations on clusters of GPUs

In the previous chapter, we have demonstrated how multi-GPU systems can be e ciently used to attain high performance without major changes from the programmability point of view. However, the scalability of this type of platforms is a problem without an easy solution in the near future. The main bottleneck remains in the PCIExpress bus. Systems with up to four GPUs attached to the same PCIExpress bus are relatively extended nowadays, but including a higher number of GPUs incurs in a serious bottleneck in data transfers with current technology.

To address this problem, clusters with a reduced number of hardware accelerators attached to each node seem an e ective solution to the performance demands of large-scale HPC applications. As the performance of the interconnection networks (e.g. Infiniband) improves, the gap between them and the PCIExpress is reduced and they commence to be comparable in bandwidth and latency. Thus, the overhead introduced by the usage of distributed memory can be masked by the second penalty induced by the usage of the PCIExpress bus.

From the software point of view, as of today there are no dense linear algebra libraries adapted for the extension of the nodes with hardware accelerators (e.g., GPUs). In this chapter, we propose an extension of the well-known PLAPACK library to adapt it to clusters of GPUs. The selection of this library is based on its modular and layered design and, following the programmability goals stated during the rest of the dissertation, on its high-level approach from the developer point of view. We propose di erent techniques to improve performance and reduce data transfers, and show experimental results for some extended dense linear algebra operations from BLAS and LAPACK on a large GPU cluster.

The chapter is structured as follows. Sections 6.1 and 6.2 introduce the basic concepts behind distributed-memory architectures and message-passing programming, respectively, that will be useful in the rest of the chapter. In Section 6.3 we o er an review of the most extended libraries for dense linear algebra computation on distributed-memory architectures. This overview includes present and forthcoming libraries. In Section 6.4 we expose the layered structure of the PLAPACK library. Section 6.5 describes the process and design decisions taken to port PLAPACK to clusters with GPUs. Experimental results on a large GPU cluster are given in Section 6.6. Finally, Section 6.7 summarizes the main contributions of the chapter.

165

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]