Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

MatrixCUDAFranDissertation.pdf

Скачиваний:

Добавлен:

22.03.2016

Размер:

2.18 Mб

Скачать

☆

<<< < Предыдущая 23 24 25 26 27 28 29 30 31 32 33 3435 / 4735 36 37 38 39 40 41 42 43 44 45 46 47 > Следующая >>>

CHAPTER 5. MATRIX COMPUTATIONS ON MULTI-GPU SYSTEMS

SYRK on 4 GPUs on TESLA2

SYRK implementations on TESLA2

	1000	Runtime Version 4				1000
	1000	Runtime Version 3				1000
		Runtime Version 3
		Runtime Version 2
		Runtime Version 1
	800					800
GFLOPS	600				GFLOPS	600
GFLOPS	400				GFLOPS	400
	400					400
	200					200
	0					0
	0	5000	10000	15000	20000

	Algorithm-by-blocks on 4 GPUs
	Best blocked algorithm on 1 GPU
	NVIDIA CUBLAS on 1 GPU
0	5000	10000	15000	20000

Matrix size

Figure 5.16: Performance (left) and comparison with mono-GPU implementations (right) of the SYRK implementation using 4 GPUs on TESLA2.

which could be gained from it are very similar to those already obtained for the Cholesky factorization and other BLAS routines. Instead, we perform a comparison of the algorithm-by-blocks on four GPUs of TESLA2, the same algorithm on only one GPU, our tuned blocked implementation on one GPU, and the NVIDIA CUBLAS implementation.

Figure 5.17 compares the best performance attained using the run-time system on the multiGPU setup (with four GPUs) with the three mono-GPU implementations. Note that we attain more than 1 TFLOP of peak performance using four GPUs in the same system for the largest tested matrices. In particular, the performance rate for n = 20,480 is 1.1 TFLOP.

Compared with the single-GPU implementations, we attain 376 GFLOPS in our best blocked algorithm implementation (see Chapter 3 for more details), 295 GFLOPS in a mono-GPU implementation using the developed runtime, and 119 GFLOPS using the NVIDIA CUBLAS implementation.

5.6.Conclusions

The emergence of a new hardware architecture usually involves extensive e orts from the software point of view in order to exploit its full potential. Multi-GPU systems are not an exception, and several works have advocated for low-level ad-hoc implementations to fully exploit the huge performance available in this type of architectures.

Following the rationale of the rest of this thesis, our approach and main contribution is essentially di erent. We advocate for a high-level approach, which abstracts the library developer from the particularities of the underlying architecture, and still considers performance as the main goal of our implementations.

To accomplish this, our ﬁrst contribution is a reformulation of multi-GPUs, viewing them as a multi-core architecture, and considering each GPU in the system as a single core. With this analogy, many well-known concepts and techniques successfully applied in the past for sharedand distributed-memory programming can be also applied to modern multi-GPU architectures.

However, there are speciﬁc characteristics of this kind of architectures that pose challenging difﬁculties for the implementation of e cient run-time systems; speciﬁcally, we refer to data transfers and separate memory spaces. In response to this problem, a second contribution of the chapter is

160

5.6. CONCLUSIONS

GEMM implementations on TESLA2

GFLOPS

	Algorithm-by-blocks on 4	GPUs
1000	Algorithm-by-blocks on 1	GPU


	Best blocked algorithm on 1 GPU
	NVIDIA CUBLAS on 1 GPU
800

600

400

200
0
0	2000	4000	6000	8000	10000	12000	14000	16000	18000

Matrix size

Figure 5.17: Performance comparison with mono-GPU implementations of the GEMM implementation using 4 GPUs on TESLA2.

a run-time system that is not only responsible of exploiting task parallelism, scheduling tasks to execution units or tracking data dependencies, but also transparently and e ciently handling data transfers between GPUs.

We have introduced techniques to reduce the amount of data transfers and thus increase data locality as well. We have also validated the e ciency of the runtime by evaluating many well-known dense linear algebra operations. The scalability and peak performance of the implementations is remarkable. Although the programmability of the solution is di cult to measure, the FLAME programming model allows a straightforward transition between existing sequential codes and parallel codes exploiting task parallelism.

Another remarkable contribution of the work is the fact that the major part of the concepts and techniques presented are not exclusive of a given runtime system or even a speciﬁc architecture. From this point of view, similar techniques have been applied by the author of the thesis to port the SMPSs runtime to platforms with multiple GPUs in a transparent way for the programmer [16]. This runtime (GPUSs) has been successfully tested with other type of hardware accelerators (ClearSpeed boards [50]) with similar performance results, is a clear demonstration of the portability of the proposed solution.

161

CHAPTER 5. MATRIX COMPUTATIONS ON MULTI-GPU SYSTEMS

162

Part IV

Matrix computations on clusters of GPUs

163

<<< < Предыдущая 23 24 25 26 27 28 29 30 31 32 33 3435 / 4735 36 37 38 39 40 41 42 43 44 45 46 47 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
22.03.20161.06 Mб224MATER_3.doc
#
18.11.2019295.42 Кб0MATLAB-1.doc
#
19.11.2019203.78 Кб0MATLAB-2.doc
#
09.02.20153.49 Mб22MATLAB-3.doc
#
09.02.2015344.3 Кб10Matrices.pdf
#
22.03.20162.18 Mб14MatrixCUDAFranDissertation.pdf
#
21.09.2019139.22 Кб2matved.docx
#
24.04.201933.9 Mб2maximum.docx
#
09.02.2015360.31 Кб63MA_1_пособие.pdf
#
09.02.201534.57 Mб8MA_Kudriav1.pdf
#
09.02.201526.97 Mб11MA_Kudriav2.pdf