Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
MatrixCUDAFranDissertation.pdf
Скачиваний:
14
Добавлен:
22.03.2016
Размер:
2.18 Mб
Скачать

CHAPTER 6. MATRIX COMPUTATIONS ON CLUSTERS OF GPUS

consider the necessary changes in the Cholesky factorization routine shown in Figure 6.11. An hypothetical driver would include the appropriate adapted codes to allocate, manage and free objects bound to the GPU memory space.

6.6.Experimental results

The goal of the experiments in this section is twofold. First, to report the raw performance that can be attained with the accelerated version of the PLAPACK library for two common linear algebra operations: the matrix-matrix multiplication and the Cholesky factorization. Second, to illustrate the scalability of the proposed solution by executing the tests on a moderate number of GPUs.

LONGHORN is a hybrid CPU/GPU cluster designed for remote visualization and data analysis. The system consists of 256 dual-socket nodes, with a total of 2,048 compute cores (Intel Xeon Nehalem QuadCore), 512 GPUs (128 NVIDIA Quadro Plex S4s, each containing 4 NVIDIA FX5800), 13.5 TBytes of distributed memory and a 210 TBytes global file system. The detailed specifications of the cluster were illustrated in Table 1.2.

In our experiments we only employ 16 compute nodes from LONGHORN. The results for 1, 2, 4, 8 and 16 GPUs make use of one of the GPUs in each node, with one MPI process per computing node. The results on 32 GPUs (there are two GPUs per node) were obtained using two MPI processes per node. This setup also illustrates the ability of the accelerated version of PLAPACK to deal with systems equipped with more than one accelerator per node (e.g., in a configuration with nodes connected to NVIDIA Tesla S1070 servers as the TESLA2 machine evaluated in Chapter 5). Here, only those results that are obtained for optimal values of the distribution and algorithmic block sizes are shown.

The matrix-matrix multiplication is frequently abused as a showcase of the highest attainable performance of a given target architecture. Following this trend, we have developed an implementation of this operation based on the PLAPACK PLA Gemm routine. The performance results for the matrix-matrix multiplication are shown in Figure 6.12. The highest performance achieved with the adapted version of the matrix multiplication routine is slightly over 8 TFLOPS when using 32 GPUs for matrices of dimension 61,440 × 61,440.

A similar experiment has been carried out for the Cholesky factorization. In our implementation of the PLAPACK routine PLA Chol, the factorization of the diagonal blocks is computed in the CPU using the general-purpose cores while all remaining operations are performed on the GPUs. This hybrid strategy has been successfully applied in previous chapters and studies [22, 23, 142]. Figure 6.13 reports the performance of the Cholesky routine on LONGHORN, which delivers 4.4 TFLOPS for matrices of dimension 102,400 × 102,400.

A quick comparison between the top performance of the matrix-matrix multiplication using the accelerated version of PLAPACK (8 TFLOPS) and the (theoretical) peak performance of the CPUs of the entire system (41.40 TFLOPS in single-precision arithmetic) reveals the advantages of exploiting the capabilities of the GPUs: using only 6% of the graphics processors available in the cluster (32 out of 512 GPUs) it is possible to attain 20% of the peak performance of the machine considering exclusively the available CPUs.

Figure 6.14 illustrates the scalability of the device-centric implementation matrix-matrix multiplication routine. Compared with the performance of the “serial” tuned implementation of the matrix-matrix product routine in NVIDIA CUBLAS, our routine achieves a 22× speedup on 32 GPUs, which demonstrates the scalability of the solution. Two main reasons account for the performance

194

6.6. EXPERIMENTAL RESULTS

Matrix-matrix multiplication on LONGHORN

 

9000

32 Quadro FX5800

 

 

 

 

 

 

 

 

 

 

 

8000

16 Quadro FX5800

 

 

 

 

 

8 Quadro FX5800

 

 

 

 

 

 

 

 

 

 

 

 

4 Quadro FX5800

 

 

 

 

 

7000

2 Quadro FX5800

 

 

 

 

 

 

1 Quadro FX5800

 

 

 

 

 

6000

 

 

 

 

 

 

GFLOPS

5000

 

 

 

 

 

 

4000

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3000

 

 

 

 

 

 

 

2000

 

 

 

 

 

 

 

1000

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

0

10000

20000

30000

40000

50000

60000

 

 

 

Matrix size (m = n = k)

 

 

Figure 6.12: Performance of the device-centric implementation of GEMM on 32 GPUs of

LONGHORN.

Cholesky factorization on LONGHORN

 

9000

32 Quadro FX5800

 

 

 

 

 

 

 

 

 

 

 

8000

16 Quadro FX5800

 

 

 

 

 

8 Quadro FX5800

 

 

 

 

 

 

 

 

 

 

 

 

4 Quadro FX5800

 

 

 

 

 

7000

2 Quadro FX5800

 

 

 

 

 

 

1 Quadro FX5800

 

 

 

 

 

6000

 

 

 

 

 

 

GFLOPS

5000

 

 

 

 

 

 

4000

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3000

 

 

 

 

 

 

 

2000

 

 

 

 

 

 

 

1000

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

0

10000

20000

30000

40000

50000

60000

Matrix size

Figure 6.13: Performance of the device-centric implementation of the Cholesky factorization on 32

GPUs of LONGHORN.

195

CHAPTER 6. MATRIX COMPUTATIONS ON CLUSTERS OF GPUS

Matrix-matrix multiplication on LONGHORN

 

30

32 Quadro FX5800

 

 

 

 

 

16 Quadro FX5800

 

 

 

 

 

 

 

 

 

 

 

 

8 Quadro FX5800

 

 

 

 

 

25

4 Quadro FX5800

 

 

 

 

 

2 Quadro FX5800

 

 

 

 

 

 

1 Quadro FX5800

 

 

 

 

 

20

 

 

 

 

 

 

Speedup

15

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

0

10000

20000

30000

40000

50000

60000

 

 

 

Matrix size (m = n = k)

 

 

Figure 6.14: Speed-up of the device-centric GEMM implementation on LONGHORN.

penalty: the Infiniband network and the PCIExpress bus (mainly when 32 GPUs/16 nodes are employed as the bus is shared by two GPUs per node).

The plots in Figure 6.15 evaluate the performance of the original PLAPACK implementation and the GPU-accelerated version of the library for the matrix-matrix multiplication (left-side plot) and the Cholesky implementation (right-side plot). Results are shown only on 16 nodes of LONGHORN. Thus, the plots report the performance of the PLAPACK implementation using 128 Intel Nehalem cores (8 cores per node) versus those of the accelerated library using 16 and 32 GPUs (that is, one or two GPUs per each node). For the matrix-matrix multiplication, the highest performance for PLAPACK is 2.3 TFLOPS, while the accelerated version of the library attains 3.6 TFLOPS for 16 GPUs and 6.4 TFLOPS for 32 GPUs. The speedups obtained by the accelerated routines are 1.6× and 2.8×, respectively. For the Cholesky implementation, the PLAPACK routine attains a peak performance of 1.6 TFLOPS, compared with the 2.6 and 4.5 TFLOPS achieved by the accelerated versions of the routines on 16 and 32 GPUs, respectively. In this case, the corresponding speedups are 1.7× and 2.8×, respectively.

Note how the performance of the CPU-based version of PLAPACK is higher for matrices of small dimension. This fact is usual in GPGPU algorithms, and has already been observed in other works [22, 142] and in previous chapters. In response to this, hybrid algorithms combining CPU and GPU execution are proposed in those works. In the PLAPACK library, the implementation of hybrid algorithms would require a deep modification of internals of the library. Other approaches for multi-GPU computing computations can be integrated in the library to optimally exploit the heterogeneous resources in each node of the cluster. Independent studies, such as StarPU [13] or studies from the author of this thesis, such as the runtime proposed in Chapter 5 or GPUSs [16] can be integrated into the library to automatically deal with the heterogeneous nature of each node (multiple GPUs and general-purpose multi-core processors).

The benefits of using an approach in which data are kept in the GPU memory during the whole computation are captured in Figure 6.16. The results report the performance of the host-centric approach, in which data is stored in the main memory of each node and transferred to the GPU when it is strictly necessary, and the device-centric approach, in which data is stored (most of the time) in the GPU memory. The advantages of the second approach, in terms of higher performance,

196

6.6. EXPERIMENTAL RESULTS

Matrix-matrix multiplication on LONGHORN

Cholesky factorization on LONGHORN

 

7000

32 Quadro FX5800

 

3500

 

 

 

 

 

 

16 Quadro FX5800

 

 

 

6000

128 Intel Xeon Nehalem Cores

 

3000

 

5000

 

 

 

2500

GFLOPS

4000

 

 

GFLOPS

2000

3000

 

 

1500

 

 

 

 

 

2000

 

 

 

1000

 

1000

 

 

 

500

 

0

 

 

 

0

 

0

5000

10000 15000 20000 25000 30000 35000 40000

 

 

 

 

 

Matrix size (m = n = k)

 

 

 

32 Quadro FX5800

 

 

 

 

16 Quadro FX5800

 

 

 

 

128 Intel Xeon Nehalem Cores

 

 

0

10000

20000

30000

40000

50000

60000

Matrix size

Figure 6.15: Performance of the device-centric implementation of GEMM (left) and the Cholesky factorization (right) compared with that of PLAPACK on 128 cores of LONGHORN.

Device-centrc vs. Host-centric implementations on LONGHORN (32 GPUs)

 

9000

Device-centric GEMM

 

 

 

 

Host-centric GEMM

 

 

 

 

 

 

 

 

 

8000

Device-centric Cholesky

 

 

 

 

Host-centric Cholesky

 

 

 

 

7000

 

 

 

 

 

 

6000

 

 

 

 

 

GFLOPS

5000

 

 

 

 

 

4000

 

 

 

 

 

 

3000

 

 

 

 

 

 

2000

 

 

 

 

 

 

1000

 

 

 

 

 

 

0

 

 

 

 

 

 

0

20000

40000

60000

80000

100000

 

 

 

Matrix size (m = n = k)

 

 

Figure 6.16: Cholesky factorization and GEMM on 16 nodes of LONGHORN, using the host-centric and device-centric storage approach for the accelerated implementation.

197

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]