Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

MatrixCUDAFranDissertation.pdf

Скачиваний:

Добавлен:

22.03.2016

Размер:

2.18 Mб

Скачать

☆

<<< < Предыдущая 21 22 23 24 25 26 27 28 29 30 31 3233 / 4733 34 35 36 37 38 39 40 41 42 43 44 45 > Следующая >>>

5.4. EXPERIMENTAL RESULTS

	1	2	3	4
Techniques	-	Data-a nity	Data-a nity	Data-a nity

	-	Own/alien blocks	Own/alien blocks	Own/alien blocks
	-	Write-through	Write-through	Write-back


Software cache	-	-	Alien blocks cached	Alien blocks cached
	-	-	Write-invalidate	Write-invalidate


Beneﬁts	-	Reduce number of writes	Reduce number of writes	Reduce number of reads
		to GPU (own blocks)	to GPU (alien blocks)	from GPU

Table 5.6: Summary of the techniques and beneﬁts introduced by the successive improvements in the runtime.

Version 2: Static data layout combined with write-through. The number of writes of own blocks is reduced as own blocks already transferred to a GPU are not transferred there again by future instructions.

Version 3: Software-cache of alien blocks. The number of writes of alien blocks is reduced by keeping them in a cache managed in software.

Version 4: Software-cache of alien blocks with write-back. The number of reads of own blocks is reduced by only maintaining the coherence between main memory and GPU memories when it is strictly necessary.

As an additional advantage, the improvements introduced in the runtime and the design of the runtime itself are independent from the underlying architecture. Assume a system consisting of a machine with a central memory pool with multiple hardware accelerators connected to it, each one with an independent memory space. With this setup, the techniques exposed above and their beneﬁts can implemented with minor modiﬁcations.

5.4.Experimental results

5.4.1.Impact of the block size

The block size is a critical factor to attain high performance in algorithms-by-blocks. Independently of the techniques used to reduce data transfers, to improve data a nity, etc. an incorrect election of this parameter lead to sub-optimal performance. The ﬁrst set of experiments aims at determining the optimal value for the block size, as well as its e ect on the ﬁnal performance of the parallel implementation.

In our case, the optimal value for the block size is a trade-o between a number of factors:

Potential parallelism: a smaller block size is translated into a ﬁner granularity. A larger block dimension leads to less concurrency. Given the pool of execution units available in the system, the size of the block greatly inﬂuences the degree of parallel execution on the sub-problems, and reduces idle times.

Data transfer e ciency: a small block size is translated into a larger number of data transfers of reduced dimension. The latency of the PCI-Express bus plays a critical role in this case, and data transfers of small blocks have a signiﬁcant performance penalty. Conversely, a

147

CHAPTER 5. MATRIX COMPUTATIONS ON MULTI-GPU SYSTEMS

larger block dimension translates into higher e ective bandwidth, and thus beneﬁts the ﬁnal performance of the implementation.

Inner BLAS kernels performance: as shown in Chapter 3, the BLAS implementations on the GPU deliver higher performances when operating with relatively large matrices, as those derived from the utilization of a big block size.

Taking into account the combination of these factors, the performance of parallel implementations will be a ected by the chosen block size from two perspectives:

1.If the block size is too small, data transfers become a serious bottleneck in the computation, and BLAS executions are ine cient on the GPU.

2.If the block size is too large, the degree of concurrency decreases and resources are wasted. In addition, data transfers through the bus are e cient, and so are BLAS executions.

In summary, a thorough analysis of the performance of the runtime system for a given linear algebra algorithm-by-blocks should show an increasing performance as the block size is increased (and thus, BLAS executions and data transfers are more e cient), attaining a peak point as potential parallelism in the algorithm reaches the optimal point. From this dimension on, parallelism decreases, and thus parallel e ciency should drop.

Consider for example the performance results in Figure 5.7. The ﬁgure shows the performance of the right-looking variant of the Cholesky factorization in GFLOPS (Y axis) for the most tuned version of the runtime implementation for several matrix sizes (Z axis), with block size (X axis) increasing from a relatively small size (b = 128) to a larger one (b = 1600). In this example, the four GPUs available on TESLA2 were used, following a 2-D data layout. Similar results have been observed for the rest of the implementations of the runtime and data a nity options. The ﬁgure o ers information on the optimal block size for each matrix size.

As predicted, the behavior of the performance curves for each matrix size is quite regular and predictable. In each case, performance is increased as the block size grows until it reaches an optimal value. From this point on, a drop in the performance rate occurs. These results demonstrate the assumptions made on the BLAS and data transfer performance, and potential parallelism. In fact, note how the optimal block size (shown in the ﬁgure as a green line) is also increased as matrix size grows. Thus, taking into account only the potential parallelism factor, the optimal block size partially depends on the total matrix dimension.

For the rest of the experimental results in this chapter, performance values for the optimal block size are shown. A complete analysis of the performance attained using a wide range of block sizes has been carried out for each operation and runtime version.

5.4.2.Number of data transfers

Once the optimal block size is found and ﬁxed for a given matrix size, the beneﬁts of the di erent implementations of the runtime are mostly dictated by the number of data transfers necessary to perform a given operation. In fact, the reduction in the number of data transfers is the main goal of the successive improvements in the runtime implementation.

To demonstrate the beneﬁts of the incremental improvements introduced from a basic runtime version, the next experiment evaluates the performance of the implementation of the algorithm- by-blocks for the Cholesky factorization (right-looking variant) on the four GPUs of TESLA2, for a ﬁxed block size (b = 768), using the four versions of the runtime, and a 2-D data layout when

148

5.4. EXPERIMENTAL RESULTS

Cholesky performance varying the block size on TESLA2

GFLOPS

750

600

450

300

150

0
20000
16000
Matrix	12000
Matrix	8000
	size

4000

Optimal block size

		1152	1408
	640	896
	640	size
0	384	Block
0	128

Figure 5.7: Performance of the Cholesky factorization using the 4 GPUs in TESLA2, varying the block size. Runtime version 4.

Number of transfers

Number of transfers to GPU (Cholesky fact.)

		Number of transfers to GPU memory - Version 1
10000		Number of transfers to GPU memory - Version 2
10000		Number of transfers to GPU memory - Version 3
		Number of transfers to GPU memory - Version 3
		Number of transfers to GPU memory - Version 4
8000
6000
4000
2000
0
0	2000	4000	6000	8000	10000	12000	14000	16000	18000	20000

Matrix size

Number of transfers

Number of transfers to main memory (Cholesky fact.)

		Number of transfers to main memory - Version 1
3500		Number of transfers to main memory - Version 2
3500		Number of transfers to main memory - Version 3
		Number of transfers to main memory - Version 4
3000
2500
2000
1500
1000
500
0
0	2000	4000	6000	8000	10000	12000	14000	16000	18000	20000

Matrix size

Figure 5.8: Number of data transfers for the Cholesky factorization using 4 GPUs on TESLA2. Block size is b = 768.

149

CHAPTER 5. MATRIX COMPUTATIONS ON MULTI-GPU SYSTEMS

appropriate (versions 2 to 4). In this case, instead of raw performance attained by each one version, we analyze the number of data transfers to and from the GPU memory. We consider the number of writes (transfers to GPU memory) and reads (transfers to main memory) for each runtime version. Thus, the aim of the experiment is to evaluate the impact on the number of data transfers, and to demonstrate how each improvement in the runtime a ects the writes or the reads of data (or even both).

Figure 5.8 reports the amount of data transfers for the experiment described above. The lefthand side plot shows the amount of writes for each implementation of the runtime; the plot on the right-hand side shows the corresponding number of reads.

Let us focus ﬁrst on the left plot. The number of writes is dramatically reduced by the application of two di erent techniques:

Data-a nity and the concept of own and alien blocks in version 2 of the runtime greatly reduces the amount of necessary data transfers from main memory to GPU memory for those blocks considered as property of the assigned GPU. As an example, consider the reduction in the number of data writes for a matrix size of n = 20,480. In this case, data transfers from main to GPU memory are reduced from 10,179 in the basic runtime implementation to 4,590 in the implementation introducing data-a nity, saving thus more than half of the total data transfers.

The usage of a cache of alien blocks in the GPU further decreases the number of data writes, as non-proprietary blocks that are read once by a given GPU do not have to be transferred again if used in a future instruction (unless they are invalidated by the proprietary of the block). This improvement is shared by versions 3 and 4 of the runtime, as the way caches work in both versions is common from the point of view of the data writes. For a matrix size n = 20,480, the number of writes is reduced from 10,179 to 1,106.

To sum up, the introduction of data-a nity and a software cache can reduce the amount of transfers from main memory to GPU memory in a factor 9× for the tested matrix sizes.

An inspection of data reads reveals that the beneﬁts are similar. In this case, no transfer reduction is introduced in the three ﬁrst versions of the runtime. However, the introduction of write-back as the coherency policy between main memory and GPU memory in version 4 reduces the amount of reads dramatically. For instance, for a matrix size n = 20,480, the amount of reads from main to GPU memory is reduced from 3,627 in versions 1-3 to 377 in version 4 of the runtime.

5.4.3.Performance and scalability

The progressive reduction of the amount of data transfers derived from the improvements introduced in the run-time system yields direct gains in the performance of the implemented linear algebra routines. To evaluate how this reduction impacts the ﬁnal performance of the parallel executions, we show the rates attained for the Cholesky factorization using the four GPUs of TESLA2. For this experiment, all BLAS routines are mapped to NVIDIA CUBLAS calls. The factorization of the diagonal blocks is performed in the multi-core processor, following the hybrid approach introduced in Chapter 4, as the optimal block size is small and this particular operation is more suited for this type of processor.

We propose an evaluation of the performance and scalability of the three algorithmic variants for the Cholesky factorization analogous to those shown in Chapter 4 (see Figure 4.1). We evaluate how the successive improvements introduced in the transfers management a ect the performance of each algorithmic variant.

150

<<< < Предыдущая 21 22 23 24 25 26 27 28 29 30 31 3233 / 4733 34 35 36 37 38 39 40 41 42 43 44 45 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
22.03.20161.06 Mб224MATER_3.doc
#
18.11.2019295.42 Кб0MATLAB-1.doc
#
19.11.2019203.78 Кб0MATLAB-2.doc
#
09.02.20153.49 Mб22MATLAB-3.doc
#
09.02.2015344.3 Кб10Matrices.pdf
#
22.03.20162.18 Mб14MatrixCUDAFranDissertation.pdf
#
21.09.2019139.22 Кб2matved.docx
#
24.04.201933.9 Mб2maximum.docx
#
09.02.2015360.31 Кб63MA_1_пособие.pdf
#
09.02.201534.57 Mб8MA_Kudriav1.pdf
#
09.02.201526.97 Mб11MA_Kudriav2.pdf