Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
MatrixCUDAFranDissertation.pdf
Скачиваний:
14
Добавлен:
22.03.2016
Размер:
2.18 Mб
Скачать

7.2. RELATED PUBLICATIONS

with clear indications that a better performance/power ratio can be achieved using more specialized hardware (accelerators), such as SSE-based units or GPUs, clearly deviating from the easy-to-understand shared-memory homogeneous architectures. This paper investigates if OpenMP could still survive in this new scenario and proposes a possible way to extend the current specification to reasonably integrate heterogeneity while preserving simplicity and portability. The paper leverages on a previous proposal that extended tasking with dependencies. The runtime is in charge of data movement, tasks scheduling based on these data dependencies, and the appropriate selection of the target accelerator depending on system configuration and resource availability.

AYGUAD´E, E., BADIA, R., BELLENS, P., CABRERA, D., DURAN, A., FERRER, R., GONZA- JOURNAL

˜

LEZ, M., IGUAL, F., JIMENEZ-GONZLEZ, D., LABARTA, J., MARTINELL, L., MARTORELL, X.,

MAYO, R., PEREZ, J., PLANAS, J., AND QUINTANA-ORT´ı, E. Extending OpenMP to survive the heterogeneous multi-core era. International Journal of Parallel Programming 38 (2010), 440–459.

This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of di erent platforms, relieving him/her from developing the specific code to o -load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, the Cell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer.

Chapter 6. Matrix computations on clusters of GPUs

In [65], we introduce the porting of the PLAPACK infrastructure to clusters of GPUs:

FOGUE, M., IGUAL, F. D., QUINTANA-ORT´ı, E. S., AND VAN DE GEIJN, R. A. Retargeting

plapack to clusters with hardware accelerators. In International Conference on High Performance Computing and Simulation (HPCS 2010) (2010), pp. 444 –451.

Hardware accelerators are becoming a highly appealing approach to boost the raw performance as well as the price-performance and power-performance ratios of current clusters. In this paper we present a strategy to retarget PLAPACK, a library initially designed for clusters of nodes equipped with general-purpose processors and a single address space per node, to clusters equipped with graphics processors (GPUs). In our approach data are kept in the device memory and only retrieved to main memory when they have to be communicated to a di erent node. Here we benefit from the object-based orientation of PLAPACK which allows all communication between host and device to be embedded within a pair of routines, providing a clean abstraction that enables an e cient and direct port of all the contents of the library. Our experiments in a cluster consisting of 16 nodes with two NVIDIA Quadro FX5800 GPUs each show the performance of our approach.

CONFERENCE

PROCEEDINGS

7.2.2.Publications indirectly related with the thesis topics

Related to dense linear algebra implementations on systems with one or multiple GPUs, a parallel research has been performed regarding out-of-core computations using hardware accelerators.

209

CHAPTER 7. CONCLUSIONS

JOURNAL

JOURNAL

CONFERENCE

PROCEEDINGS

In these publications, we explore the possibility of solving large dense linear systems stored on disk, accelerating in-core calculations by using the graphics processors. Those local routines are based on the BLAS implementations proposed in Chapter 3. The work in [40] presents a MATLAB/OCTAVE interface to accelerate out-of-core computations using hardware accelerators in the framework of linear algebra. In [60] we propose a novel strategy to e ciently virtualize graphics processors on high performance clusters:

QUINTANA-ORT´ı, G., IGUAL, F., MARQU´ES, M., QUINTANA-ORT´ı, E. S., AND VAN DE GEIJN,

R. Programming OOC matrix algorithms-by-tiles on multithreaded architectures. ACM Trans. Math. Softw. (Submitted).

CASTILLO, M., IGUAL, F. D., MARQU´ES, M., MAYO, R., QUINTANA-ORT´ı, E. S., QUINTANA-

ORT´ı, G., RUBIO, R., AND VAN DE GEIJN, R. A. Out-of-core solution of linear systems on graphics processors. International Journal of Parallel, Emergent and Distributed Systems 24, 6 (2009), 521–538.

DUATO, J., IGUAL, F. D., MAYO, R., PENA˜ , A. J., QUINTANA-ORT´ı, E. S., AND SILLA, F. An e cient implementation of GPU virtualization in high performance clusters. In Euro-Par Workshops (2009), pp. 385–394.

JOURNAL

CONFERENCE

PROCEEDINGS

7.2.3.Other publications

Image processing is a discipline in which GPUs have historically delivered near-optimal performances. As an orthogonal research line, several publications in this field have been obtained during the development of this thesis. These publications are focused on a lower level approach, presenting fine-grained optimizations and ad-hoc improvements for current GPUs on biomedical image processing. We list some of the most important publications in this area:

IGUAL, F., MAYO, R., HARTLEY, T., CATALYUREK, U., RUIZ, A., AND UJALDON, M. Color

and texture analysis on emerging parallel architectures. Journal of High Performance Computing Applications (2010). (Published online).

HARTLEY, T. D., CATALYUREK, U., RUIZ, A., IGUAL, F., MAYO, R., AND UJALDON, M.

Biomedical image analysis on a cooperative cluster of gpus and multicores. In Proceedings of the 22nd annual international conference on Supercomputing (New York, NY, USA, 2008), ICS ’08, ACM, pp. 15–25.

CONFERENCE

PROCEEDINGS

IGUAL, F., MAYO, R., HARTLEY, T., CATALYUREK, U., RUIZ, A., AND UJALDON, M.

Exploring the gpu for enhancing parallelism on color and texture analysis. In From Multicores and GPUs to Petascale. 14th International Conference on Parallel Computing (ParCo 2009) (2010), vol. 19 of Advances in Parallel Computing, IOS Press, pp. 299–306.

CONFERENCE

PROCEEDINGS

IGUAL, F., MAYO, R., HARTLEY, T., CATALYUREK, U., RUIZ, A., AND UJALDON, M.

Optimizing co-occurrence matrices on graphics processors using sparse representations. In 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA 2008). (To appear as Lecture Notes in Computer Science).

7.3.Software e orts and technological transfer

The insights and e orts in the framework of this thesis have been translated into software products and collaborations with companies. The software e orts include the release of libflame [149] as an open source library at the disposal of the scientific community. As of its release date, the

210

7.3. SOFTWARE EFFORTS AND TECHNOLOGICAL TRANSFER

Operation

LAPACK

libflame

FLAME/C

FLASH

GPU

type

name

name

support

support

 

 

 

Level-3 BLAS

 

 

 

General matrix-matrix multiply

?gemm

Gemm

sdcz

Hermitian matrix-matrix multiply

?hemm

Hemm

sdcz

Hermitian rank-k update

?herk

Herk

sdcz

Hermitian rank-2k update

?her2k

Her2k

sdcz

Symmetric matrix-matrix multiply

?symm

Symm

sdcz

Symmetric rank-k update

?syrk

Syrk

sdcz

Symmetric rank-2k update

?syr2k

Syr2k

sdcz

Triangular matrix multiply

?trmm

Trmm

sdcz

Triangular solve with multiple right-hand sides

?trsm

Trsm

sdcz

LAPACK-level

 

 

 

Cholesky factorization

?potrf

Chol

sdcz

LU factorization without pivoting

N/A

LU nopiv

sdcz

LU factorization with partial pivoting

?getrf

LU piv

sdcz

LU factorization with incremental pivoting

N/A

LU incpiv

 

sdcz

QR factorization (via UT Householder transforms)

?geqrf

QR UT

 

sdcz

QR factorization (via incremental UT Householder trans.)

N/A

QR UT inc

 

sdcz

LQ factorization (via UT Householder transforms)

?gelqf

LQ UT

 

 

sdcz

Up-and-Downdate Cholesky/QR factor

N/A

UDdate UT

 

 

sdcz

Up-and-Downdate Cholesky/QR factor

N/A

UDdate UT inc

 

 

sdcz

(via incremental UT Householder-like transforms)

 

 

 

Triangular matrix inversion

?trtri

Trinv

sdcz

Triangular transpose matrix multiply

?lauum

Ttmm

sdcz

Symmetric/Hermitian positive definite inversion

?potri

SPDinv

sdcz

Triangular Sylvester equation solve

?trsyl

Sylv

sdcz

Reduction from a symmetric/Hermitian definite generalized

[sc]sygst

Eig gest

sdcz

eigenproblem to standard form

[cz]hegst

 

 

 

 

Reduction to upper Hessenberg form

?gehrd

Hess UT

 

 

sdcz

Reduction to tridiagonal form

[sd]sytrd

Tridiag UT

 

 

sdcz

 

[cz]hetrd

 

 

 

 

Reduction to bidiagonal form

?gebrd

Bidiag UT

 

 

sdcz

Table 7.1: Dense linear algebra operations supported by libflame, which has full support for all four of the floating point data types: single and double precision real, single and double precision complex; ? expands to one of {sdcz}.

library was the only dense linear algebra software product with multi-GPU capabilities and a wide functionality. It implements a large subset of the LAPACK functionality and a major part of the techinques illustrated in Chapter 5. A detailed description of the functionalities of the libflame library can be found in Table 7.1.

The interest on this type of run-time systems has been translated into collaborations and awards from well-known companies:

MICROSOFT RESEARCH showed their interest in the three main research lines presented in this thesis (developments for single-GPU, multi-GPU and clusters of GPUs). As part of the agreement with the company, codes for BLAS on one and multiple GPUs, LAPACK-level approaches on one and multiple GPUs, and similar routines for distributed memory approaches will be integrated in libflame with the support of MICROSOFT. A commercial license of the library has been acquired by MICROSOFT as part of the agreement.

NVIDIA granted the HPC&A group with the NVIDIA Professor Partnership Award 2008 for its work on multi-GPU systems. At the same time, many of the graphics hardware used for performance evaluation in this document have been generously donated by the company in the framework of this collaboration.

PETAPATH, manufacturers of ClearSpeed boards [50] signed an agreement with the HPC&A group to demonstrate the adaptability of the multi-GPU system developed to heterogeneous systems with other type of hardware accelerators. A prototype of the GPUSs runtime system was also developed and tested on this type of platforms.

211

CHAPTER 7. CONCLUSIONS

7.4.Open research lines

GPU Computing is a relatively novel discipline, and thus many research lines remain open after the conclusion of this thesis. Some of them can be adapted from existing ideas from other arenas; others are new; it is likely that the last group of ideas will evolve with graphic architectures and programming paradigms.

The following list details some of the open research lines related to this thesis:

The NVIDIA CUBLAS versions used for the evaluation and development of the ideas in this thesis do not support the overlapping of calculations on the GPU and data transfers. With the advent of newer versions that support this feature, the introduction of overlapping techniques on both single-GPU, multi-GPU and clusters of GPUs will open a new research line in order to hide the bus latency. More specifically, the runtime-based approach for systems with multiple GPUs will require a full redesign in order to deal and exploit this overlapping capabilities.

Although the port of PLAPACK to clusters of GPUs combines a better programmability for message-passing architectures and remarkable performance, will soon be replaced by Elemental [113]. Fortunately, many of the design decisions in the Elemental framework are similar to those adopted in the early development of PLAPACK. A port of the Elemental framework is also in mind to adapt it to clusters of GPUs.

In the adaptation of message-passing libraries, inter-node parallelism and data transfer reduction between processes is accomplished by an appropriate algorithm choose. In the case described in this thesis, one process per GPU is spawn. However, when more than one GPU per node is available, data transfers between memory spaces can be redundant using this approach. An alternative approach would be to employ one process per node, relying the management of multiple GPUs inside the node to a run-time system as that described in Chapter 5.

The improvements described in Chapter 5 pursue the goal of data transfer reduction, without taking into account the scheduling policies to accomplish it. An alternative, but compatible approach is based in the modification of scheduling policies in order to assign tasks to the most a ne computing resource, using a technique usually referred as cache a nity [44]. This techniques have already been implemented in the public release of libflame, but further research is still in the roadmap.

While GPUs o er a near-optimal GFLOPS/price ratio, the main disadvantage of this hardware is power consumption. Energy-aware GPU computing is a field to be explored in the near future. Run-time systems provide a powerful tool to monitor and manage the configuration of the di erent computing units (in this case GPUs) according to their execution status, or the ability to redesign the scheduling policies in order to take into account the power consumption issue.

Other improvements and research lines will be ultimately dictated by the technological evolution of graphics hardware. To name three possible improvement scenarios related to each one of the parts of this thesis:

Simultaneous execution of kernels can boost performance of block-oriented BLAS-3 algorithms by executing, if possible, the operations on the blocks simultaneously on a single-GPU system. This feature is already included in the latest NVIDIA GPUs.

212

7.4.OPEN RESEARCH LINES

An scenario where direct GPU-GPU communication on multi-GPU systems is possible in the near future. In this case, provided the PCIExpress bus would disappear as the main bottleneck in the system, other strategies can be considered to improve performance. More intelligent scheduling policies, in which tasks are mapped to the most a ne accelerator (considering which data are necessary for the execution of the task and where those data are located) have already been investigated in the framework of the libflame development [42].

Future technological improvements include direct communication between GPU memories via interconnection networks (namely Infiniband). Adapting those new technologies to our developments would yield higher performance at no cost from the programmability level.

Current hardware trends include the integration of the GPU as an on-chip co-processor to the general-purpose unit. NVIDIA has recently revealed the integration of ARM processors and graphics processors, and AMD has developed similar products in the framework of the FUSION project. If this novel architectures are sucessful, many of the techniques and methodologies proposed in this thesis are likely to need further adaptation to them. However, we believe that many of the ideas and techniques investigated would have a relevant impact on the performance of dense linear algebra implementations on these novel architectures without dramatic conceptual modifications.

213

CHAPTER 7. CONCLUSIONS

214

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]