Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
MatrixCUDAFranDissertation.pdf
Скачиваний:
14
Добавлен:
22.03.2016
Размер:
2.18 Mб
Скачать

5.4. EXPERIMENTAL RESULTS

 

1

2

3

4

Techniques

-

Data-a nity

Data-a nity

Data-a nity

 

 

 

 

 

 

-

Own/alien blocks

Own/alien blocks

Own/alien blocks

 

-

Write-through

Write-through

Write-back

 

 

 

 

 

 

 

 

 

 

Software cache

-

-

Alien blocks cached

Alien blocks cached

 

-

-

Write-invalidate

Write-invalidate

 

 

 

 

 

 

 

 

 

 

Benefits

-

Reduce number of writes

Reduce number of writes

Reduce number of reads

 

 

to GPU (own blocks)

to GPU (alien blocks)

from GPU

Table 5.6: Summary of the techniques and benefits introduced by the successive improvements in the runtime.

Version 2: Static data layout combined with write-through. The number of writes of own blocks is reduced as own blocks already transferred to a GPU are not transferred there again by future instructions.

Version 3: Software-cache of alien blocks. The number of writes of alien blocks is reduced by keeping them in a cache managed in software.

Version 4: Software-cache of alien blocks with write-back. The number of reads of own blocks is reduced by only maintaining the coherence between main memory and GPU memories when it is strictly necessary.

As an additional advantage, the improvements introduced in the runtime and the design of the runtime itself are independent from the underlying architecture. Assume a system consisting of a machine with a central memory pool with multiple hardware accelerators connected to it, each one with an independent memory space. With this setup, the techniques exposed above and their benefits can implemented with minor modifications.

5.4.Experimental results

5.4.1.Impact of the block size

The block size is a critical factor to attain high performance in algorithms-by-blocks. Independently of the techniques used to reduce data transfers, to improve data a nity, etc. an incorrect election of this parameter lead to sub-optimal performance. The first set of experiments aims at determining the optimal value for the block size, as well as its e ect on the final performance of the parallel implementation.

In our case, the optimal value for the block size is a trade-o between a number of factors:

Potential parallelism: a smaller block size is translated into a finer granularity. A larger block dimension leads to less concurrency. Given the pool of execution units available in the system, the size of the block greatly influences the degree of parallel execution on the sub-problems, and reduces idle times.

Data transfer e ciency: a small block size is translated into a larger number of data transfers of reduced dimension. The latency of the PCI-Express bus plays a critical role in this case, and data transfers of small blocks have a significant performance penalty. Conversely, a

147

CHAPTER 5. MATRIX COMPUTATIONS ON MULTI-GPU SYSTEMS

larger block dimension translates into higher e ective bandwidth, and thus benefits the final performance of the implementation.

Inner BLAS kernels performance: as shown in Chapter 3, the BLAS implementations on the GPU deliver higher performances when operating with relatively large matrices, as those derived from the utilization of a big block size.

Taking into account the combination of these factors, the performance of parallel implementations will be a ected by the chosen block size from two perspectives:

1.If the block size is too small, data transfers become a serious bottleneck in the computation, and BLAS executions are ine cient on the GPU.

2.If the block size is too large, the degree of concurrency decreases and resources are wasted. In addition, data transfers through the bus are e cient, and so are BLAS executions.

In summary, a thorough analysis of the performance of the runtime system for a given linear algebra algorithm-by-blocks should show an increasing performance as the block size is increased (and thus, BLAS executions and data transfers are more e cient), attaining a peak point as potential parallelism in the algorithm reaches the optimal point. From this dimension on, parallelism decreases, and thus parallel e ciency should drop.

Consider for example the performance results in Figure 5.7. The figure shows the performance of the right-looking variant of the Cholesky factorization in GFLOPS (Y axis) for the most tuned version of the runtime implementation for several matrix sizes (Z axis), with block size (X axis) increasing from a relatively small size (b = 128) to a larger one (b = 1600). In this example, the four GPUs available on TESLA2 were used, following a 2-D data layout. Similar results have been observed for the rest of the implementations of the runtime and data a nity options. The figure o ers information on the optimal block size for each matrix size.

As predicted, the behavior of the performance curves for each matrix size is quite regular and predictable. In each case, performance is increased as the block size grows until it reaches an optimal value. From this point on, a drop in the performance rate occurs. These results demonstrate the assumptions made on the BLAS and data transfer performance, and potential parallelism. In fact, note how the optimal block size (shown in the figure as a green line) is also increased as matrix size grows. Thus, taking into account only the potential parallelism factor, the optimal block size partially depends on the total matrix dimension.

For the rest of the experimental results in this chapter, performance values for the optimal block size are shown. A complete analysis of the performance attained using a wide range of block sizes has been carried out for each operation and runtime version.

5.4.2.Number of data transfers

Once the optimal block size is found and fixed for a given matrix size, the benefits of the di erent implementations of the runtime are mostly dictated by the number of data transfers necessary to perform a given operation. In fact, the reduction in the number of data transfers is the main goal of the successive improvements in the runtime implementation.

To demonstrate the benefits of the incremental improvements introduced from a basic runtime version, the next experiment evaluates the performance of the implementation of the algorithm- by-blocks for the Cholesky factorization (right-looking variant) on the four GPUs of TESLA2, for a fixed block size (b = 768), using the four versions of the runtime, and a 2-D data layout when

148

5.4. EXPERIMENTAL RESULTS

Cholesky performance varying the block size on TESLA2

GFLOPS

750

600

450

300

150

0

 

20000

 

16000

Matrix

12000

8000

 

size

4000

Optimal block size

 

 

1152

1408

 

640

896

 

 

size

 

0

384

Block

 

128

 

 

Figure 5.7: Performance of the Cholesky factorization using the 4 GPUs in TESLA2, varying the block size. Runtime version 4.

Number of transfers

Number of transfers to GPU (Cholesky fact.)

 

 

Number of transfers to GPU memory - Version 1

 

10000

 

Number of transfers to GPU memory - Version 2

 

 

Number of transfers to GPU memory - Version 3

 

 

 

 

 

 

Number of transfers to GPU memory - Version 4

 

8000

 

 

 

 

 

 

 

 

 

 

6000

 

 

 

 

 

 

 

 

 

 

4000

 

 

 

 

 

 

 

 

 

 

2000

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

Matrix size

Number of transfers

Number of transfers to main memory (Cholesky fact.)

 

 

Number of transfers to main memory - Version 1

 

3500

 

Number of transfers to main memory - Version 2

 

 

Number of transfers to main memory - Version 3

 

 

 

Number of transfers to main memory - Version 4

 

3000

 

 

 

 

 

 

 

 

 

 

2500

 

 

 

 

 

 

 

 

 

 

2000

 

 

 

 

 

 

 

 

 

 

1500

 

 

 

 

 

 

 

 

 

 

1000

 

 

 

 

 

 

 

 

 

 

500

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

Matrix size

Figure 5.8: Number of data transfers for the Cholesky factorization using 4 GPUs on TESLA2. Block size is b = 768.

149

CHAPTER 5. MATRIX COMPUTATIONS ON MULTI-GPU SYSTEMS

appropriate (versions 2 to 4). In this case, instead of raw performance attained by each one version, we analyze the number of data transfers to and from the GPU memory. We consider the number of writes (transfers to GPU memory) and reads (transfers to main memory) for each runtime version. Thus, the aim of the experiment is to evaluate the impact on the number of data transfers, and to demonstrate how each improvement in the runtime a ects the writes or the reads of data (or even both).

Figure 5.8 reports the amount of data transfers for the experiment described above. The lefthand side plot shows the amount of writes for each implementation of the runtime; the plot on the right-hand side shows the corresponding number of reads.

Let us focus first on the left plot. The number of writes is dramatically reduced by the application of two di erent techniques:

Data-a nity and the concept of own and alien blocks in version 2 of the runtime greatly reduces the amount of necessary data transfers from main memory to GPU memory for those blocks considered as property of the assigned GPU. As an example, consider the reduction in the number of data writes for a matrix size of n = 20,480. In this case, data transfers from main to GPU memory are reduced from 10,179 in the basic runtime implementation to 4,590 in the implementation introducing data-a nity, saving thus more than half of the total data transfers.

The usage of a cache of alien blocks in the GPU further decreases the number of data writes, as non-proprietary blocks that are read once by a given GPU do not have to be transferred again if used in a future instruction (unless they are invalidated by the proprietary of the block). This improvement is shared by versions 3 and 4 of the runtime, as the way caches work in both versions is common from the point of view of the data writes. For a matrix size n = 20,480, the number of writes is reduced from 10,179 to 1,106.

To sum up, the introduction of data-a nity and a software cache can reduce the amount of transfers from main memory to GPU memory in a factor 9× for the tested matrix sizes.

An inspection of data reads reveals that the benefits are similar. In this case, no transfer reduction is introduced in the three first versions of the runtime. However, the introduction of write-back as the coherency policy between main memory and GPU memory in version 4 reduces the amount of reads dramatically. For instance, for a matrix size n = 20,480, the amount of reads from main to GPU memory is reduced from 3,627 in versions 1-3 to 377 in version 4 of the runtime.

5.4.3.Performance and scalability

The progressive reduction of the amount of data transfers derived from the improvements introduced in the run-time system yields direct gains in the performance of the implemented linear algebra routines. To evaluate how this reduction impacts the final performance of the parallel executions, we show the rates attained for the Cholesky factorization using the four GPUs of TESLA2. For this experiment, all BLAS routines are mapped to NVIDIA CUBLAS calls. The factorization of the diagonal blocks is performed in the multi-core processor, following the hybrid approach introduced in Chapter 4, as the optimal block size is small and this particular operation is more suited for this type of processor.

We propose an evaluation of the performance and scalability of the three algorithmic variants for the Cholesky factorization analogous to those shown in Chapter 4 (see Figure 4.1). We evaluate how the successive improvements introduced in the transfers management a ect the performance of each algorithmic variant.

150

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]