Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Московский автомобильно-дорожный государственный технический университет

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

UnEncrypted

.pdf

Скачиваний:

Добавлен:

16.05.2015

Размер:

6.75 Mб

Скачать

☆

<<< < Предыдущая 16 17 18 19 20 21 22 23 24 25 26 2728 / 4228 29 30 31 32 33 34 35 36 37 38 39 40 > Следующая >>>

Elena Cagliero, Ezio Venturino

First model: fourth equilibrium

	0.85
A	0.8
	0.75	50	100	150	200	250	300
	0	50	100	150	200	250	300
				time
	2.6
T	2.4
	2.4
	2.2	50	100	150	200	250	300
	0	50	100	150	200	250	300
				time

0.4

0.35

0.3

0.25

0.2

0	50	100	150	200	250	300
			time

Figure 4: Equilibrium E4 is obtained here for the parameters σ = 0.4, r = 0.5, μ = 0.2, q = 0.5, w = 0.5, m = 0.3, f = 0.2, g = 0.1, K = 10..

coexistence equilibrium is also possible, with both populations and endemic disease.

The only alternative that in these circumstances is impossible, is the fact that predators can thrive just with infected prey. But this is a consequence of the model assumptions, in that infected prey are assumed to be too weak to sustain themselves.

References

[1]V. Ajraldi, M. Pittavino, E. Venturino, Modelling herd behavior in population systems, Nonlinear Analysis Real World Applications, 12 (2011) 2319-2338.

[2]V. Ajraldi, E. Venturino, Mimicking spatial e ects in predator-prey models with group defense, Proceedings of the 2009 International Conference on Computational

c CMMSE

Page 263 of 1573

ISBN:978-84-615-5392-1

Ecoepidemics with group defense

First model: limit cycle (2D)

1.5

	1
0.5		100	200	300	400	500	600	700	800	900
	0	100	200	300	400	500	600	700	800	900
					time
	3
T	2
	1	100	200	300	400	500	600	700	800	900
	0	100	200	300	400	500	600	700	800	900
					time
	3
U	2
U	1
	1
	0	100	200	300	400	500	600	700	800	900
	0	100	200	300	400	500	600	700	800	900

time

Figure 5: Two-dimensional limit cycle around E4 obtained for the parameter values σ = 0.5, r = 0.5, μ = 0.4, q = 0.2, w = 0.5, m = 0.2, g = 0.1, f = 0.3, for which K = K† ≡ 3 m 2 .

and Mathematical Methods in Science and Engineering, J. Vigo Aguiar, P. Alonso, S. Oharu, E. Venturino, B. Wade (Editors), Gij´on, Asturias, Spain, June 30th - July 3rd (2009) 57-66.

[3]E. Beltrami, T. O. Carroll, Modelling the role of viral disease in recurrent phytoplankton blooms, J. Math. Biol. 32 (1994) 857-863.

[4]P. A. Braza, Predatorprey dynamics with square root functional responses, Nonlinear Analysis Real World Applications 13(4) (2012) 1837-1843.

[5]C. Cosner, D. L. DeAngelis, J. S. Ault, D. B. Olson, E ects of spatial grouping on the functional response of predators, Theoretical Population Biology 56 (1999) 65-

c CMMSE

Page 264 of 1573

ISBN:978-84-615-5392-1

Elena Cagliero, Ezio Venturino

First model: limit cycle (3D)

2.5

1.5

0.5

2.5		1.05
2		1
2		0.95
1.5		0.9
T	1 0.8	0.85
T	1 0.8	A
		A

Figure 6: Phase space representation of the limit cycle of Figure 5 for the same parameter values.

75.

[6]H. I. Freedman, G. Wolkowitz, Predator-prey systems with group defence: the paradox of enrichment revisited, Bull. Math. Biol. 48 (1986) 493-508.

[7]K. P. Hadeler, H. I. Freedman, Predator-prey populations with parasitic infection, J. of Math. Biology 27 (1989) 609-631.

[8]H. Malchow, S. Petrovskii, E. Venturino, Spatiotemporal patterns in Ecology and Epidemiology, CRC, Boca Raton, 2008.

[9]E. Venturino, Epidemics in predator-prey models: disease among the prey, in O. Arino, D. Axelrod, M. Kimmel, M. Langlais: Mathematical Population Dynamics:

c CMMSE

Page 265 of 1573

ISBN:978-84-615-5392-1

Ecoepidemics with group defense

Analysis of Heterogeneity, Vol. one: Theory of Epidemics, Wuertz Publishing Ltd, Winnipeg, Canada, p. 381-393, 1995.

[10]E. Venturino, Epidemics in predator-prey models: disease in the predators, IMA Journal of Mathematics Applied in Medicine and Biology 19 (2002) 185-205.

[11]E. Venturino, Ecoepidemiology 15 years later: a review, Numerical Analysis and Applied Mathematics, T. Simos (Editor), Proceedings of ICNAAM 2007, AIP 936 (2007) 31-34.

[12]E. Venturino, A minimal model for ecoepidemics with group defense, J. of Biological Systems 19(4) (2011) 763-785.

c CMMSE

Page 266 of 1573

ISBN:978-84-615-5392-1

Proceedings of the 12th International Conference on Computational and Mathematical Methods

in Science and Engineering, CMMSE 2012 July, 2-5, 2012.

Applications of quantum thermal baths in vibrational spectroscopy

Florent Calvo1

1 CNRS, LASIM, University of Lyon, France

emails: fcalvo@lasim.univ-lyon1.fr

Abstract

Quantum nuclear e ects are important for weakly bound or light atoms and at low temperatures. They are manifested by residual energy stored as zero-point vibrations, tunneling, and possible exchange e ects in bosonic systems. Unfortunately, the vibrational Schrodinger equation is di cult to solve except for very small systems. Quantum thermal baths (QTBs) have been proposed in the recent years as an alternative to the convenient, but still computationally expensive schemes based on path integrals. The QTB method relies on propagating a stochastic Langevin equation with a correlated (colored) noise designed to produced a power spectrum which satisﬁes the quantum ﬂuctuation-dissipation theorem. As such, its numerical cost is close to that of a standard classical Langevin equation, making it a very promising technique for large-scale atomic and molecular systems.

In the present contribution, we discuss the application of the QTB approach for gas-phase systems and their equilibrium properties and vibrational spectroscopy. The implementation of the method is described, with an emphasis on speciﬁc computational aspects involved in the generation of the colored noise. The performance of the method is assessed by comparison with dedicated path-integral molecular dynamics simulations for simple ionic clusters, as well as polycyclic aromatic hydrocarbons.

Key words: Molecular dynamics, quantum nuclear e ects, Langevin equation

c CMMSE

Page 267 of 1573

ISBN:978-84-615-5392-1

Proceedings of the 12th International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2012 July, 2-5, 2012.

Application of Auto-Tuning Techniques to High-Level Linear Algebra Shared-Memory Subroutines

Jes´us C´amara1, Javier Cuenca2, Domingo Gim´enez1 and Antonio M. Vidal3

1 Department of Informatic and Systems, University of Murcia, 30071, Murcia, Spain

2 Department of Engineering and Technology of Computers, University of Murcia, 30071,

Murcia, Spain

3 Department of Informatic Systems and Computation, Polytechnic University of

Valencia, 46022, Valencia, Spain

emails: jcm23547@um.es, jcuenca@um.es, domingo@um.es, avidal@dsic.upv.es

Abstract

The introduction of auto-tuning techniques in linear algebra shared-memory routines is analysed. Our goal is to study how auto-tuning techniques can be included in routines which call lower level routines, so using information from the installation of the low-level routine to take at run time some decisions to reduce the total execution time. The study is carried out with Cholesky factorization routines: the potrf LAPACK routine with internal calls to multithreaded MKL routines and a modiﬁed LAPACK version with a two-level implementation of its internal level-3 BLAS subroutines. For this modiﬁed LAPACK version, the experiments show that the number of threads to use at each parallelism level and the block size used internally by the potrf routine can be selected to automatically obtain satisfactory execution times.

Key words: autotuning, multithreading, linear algebra, OpenMP, MKL

1Introduction

The appearance of multicore and cc-NUMA systems has led to the development of optimized software for these systems. Software optimization techniques are used in parallel routines to decide how to execute them to obtain the lowest execution time. Decisions are taken at run time as a result of the work performed at installation time, modelling the execution time of

c CMMSE

Page 268 of 1573

ISBN:978-84-615-5392-1

Auto-Tuning Techniques on High-Level Subroutines

the routines or by applying some empirical study of the behaviour of the routines. Moreover, these decisions may vary depending on the type of computational system used. Some of these decisions could be: select the appropriate number of threads to use at each level of parallelism, how to assign processes to processors or select the correct block size in the case of algorithms by blocks. In this work, the previous ideas of installing multithreaded basic linear algebra routines in large cc-NUMA systems [1] are combined and extended. In [1], auto-tuning is carried out by applying installation techniques to the BLAS-3 matrix multiplication routine (dgemm), which constitutes the basic subroutine for many computational routines. In this paper the same ideas are applied to the higher-level routine potrf of the LAPACK library: the Cholesky factorization of a symmetric positive deﬁnite matrix. The same methodology has been applied to other high-level routines (dgesv, zsysv, dgetrf, etc) with similar results. For that purpose, the matrix multiplication routine used in potrf is replaced by a parallel version (called dgemm2L) specially adapted for large multicore systems.

The rest of the paper is organized in the following way. Section 2 summarizes the autotuning methodology for linear algebra routines in large NUMA systems. Section 3 analyzes the behaviour of the Cholesky factorization when the auto-tuning methodology is applied to the dgemm routine, and the results are compared with those of the reference version of the LAPACK potrf routine when the number of threads and the block size are automatically selected. Finally, the conclusions and future research lines are shown in Section 4.

2The auto-tuning methodology

The implementations of shared-memory linear algebra routines are normally not very scalable. This produces a degradation of the performance in large cc-NUMA systems. To improve the scalability of the routines, the auto-tuning methodology explained in [1] for the dgemm routine can be extended to higher-level routines. The goal of this methodology is to select the most appropriate number of threads to use at each level of parallelism.

This paper shows how this methodology can be applied to high-level linear algebra routines, using the Cholesky factorization as proof of concept. The Cholesky factorization is normally used to solve a linear system of the type AX = B, with A a symmetric positive deﬁnite matrix. To compute it, we can use the multithreaded MKL version of the potrf routine or the reference version included in the LAPACK library. Algorithm 1 shows the scheme used in the LAPACK potrf routine to compute the Cholesky factorization.

The installation of the routine in the system is made by executing the routine for each matrix size speciﬁed in the installation set 1 by varying the number of OpenMP and MKL threads at each level of parallelism, from one to the number of available cores in the system, and using a combination of threads not exceeding the maximum number of cores. Once the routine has been installed, the number of threads with which the lowest time is obtained

1An installation set consists of some problem sizes used to install the routine in the system.

c CMMSE

Page 269 of 1573

ISBN:978-84-615-5392-1

Jesus´ Camara,´ Javier Cuenca, Domingo Gimenez,´ Antonio M. Vidal

Compute	the	Cholesky		factorization	A = L L T .
	20	J =	1 , N ,
DO	20	J =	1 , N ,	NB

	Update		and	factorize the	current diagonal block and test
	for non−positive−definiteness .

JB = MIN ( NB , N−J+1) CALL dsyrk ( . . . ) CALL dpotf2 ( . . . )

IF ( J+JB . LE . N ) THEN

Compute the current block column .

CALL dgemm ( . . . )

CALL dtrsm ( . . . ) END IF

20 CONTINUE

}

Algorithm 1: Scheme of the LAPACK Cholesky (potrf) routine.

for each problem size is stored, and, at execution time, for a particular problem size, the number of threads to be used to solve the problem is selected by using the information stored during the installation phase.

In the Cholesky factorization, the auto-tuning methodology can be applied to its internal dgemm routine, which is used to perform all the matrix multiplications involved in the computation of the elements of the blocks (whose size is automatically determined by the LAPACK ILAENV function) of the lower triangular part of matrix A. Therefore, it is necessary to work directly with the reference potrf routine. The dgemm routine is then replaced by a parallel implementation that uses two levels of parallelism (dgemm2L) and the auto-tuning process is performed in order to select the most appropriate number of threads at each level of parallelism. The other routines internally used in the Cholesky factorization are called using their corresponding multithreaded MKL implementation.

Experiments have been carried out in a platform called Saturno, a shared-memory system with four hexa-cores (24 cores) and the installation set used is: {256, 768, 1280, 1792, 2304, 2816, 3328, 3840, 4352}. Di erent problem sizes are used for validation (validation set). At running time, the decisions for the problem sizes in the validation set are taken by applying an interpolation process to the information stored during the installation phase. Table 1 shows the execution times (in seconds) obtained with the auto-tuning methodology and the lowest execution times obtained experimentally by a perfect oracle. The number of OpenMP and MKL threads used at each level of parallelism is also shown. The number of threads most frequently used is 24 (the total number of cores of the system), but with di erent combinations (3-8, 4-6, 6-4). The times obtained with the auto-tuning methodology are normally close to the optimum, and the total number of threads used is also similar.

c CMMSE

Page 270 of 1573

ISBN:978-84-615-5392-1

Auto-Tuning Techniques on High-Level Subroutines

In larger systems (as those considered in [1]), di erences in execution times would be higher.

		Optimum			Auto-Tuning
N	OMP	MKL	Time	OMP	MKL	Time
	(threads)	(threads)	(sec)	(threads)	(threads)	(sec)

512	1	16	0.001219	1	14	0.001447
1024	4	6	0.003989	3	8	0.004230
1536	4	6	0.007623	6	4	0.008004
2048	2	12	0.013438	4	6	0.014118
2560	7	3	0.032550	6	4	0.084170
3072	7	3	0.050523	6	4	0.083533
3584	3	8	0.078012	4	6	0.078633
4096	3	8	0.124691	4	6	0.127465

Table 1: Execution times (in seconds) obtained with the application of the auto-tuning methodology (Auto-Tuning) to the dgemm routine of potrf and lowest experimental execution time (Optimum), and number of OpenMP and MKL threads used in these executions.

3Application to Cholesky factorization routines

In order to analyse the improvement achieved with this methodology when applied to linear algebra routines which call lower level routines, a comparative study of the execution time obtained by di erent implementations of the Cholesky potrf routine has been carried out. Table 2 shows the results obtained for the reference LAPACK routine, a potrf LAPACK routine which internally calls multithreaded MKL routines (dsyrk, dpotf2 and dtrsm) and the modiﬁed LAPACK routine where dgemm is replaced by the auto-tuned dgemm2L routine. The last column shows the speed-up achieved by the modiﬁed LAPACK routine with auto-tuning respect to the lowest execution time obtained with the LAPACK routine with multithreaded MKL. The results of applying the auto-tuning methodology are satisfactory, but for some problem sizes a loss of performance occurs due to the interpolation applied to select the number of threads.

3.1Selecting the block size

The Cholesky factorization of LAPACK (Algorithm 1) is computed by blocks. The size and form of these blocks vary depending on the value internally selected by the LAPACK ILAENV function. In this function, that value is selected using information of the problem size but not based in the number of threads used. Therefore, we can reduce the execution time even more by selecting the optimum block size for each value of the installation set. To

c CMMSE

Page 271 of 1573

ISBN:978-84-615-5392-1

Jesus´ Camara,´ Javier Cuenca, Domingo Gimenez,´ Antonio M. Vidal

NLAPACK LAPACK+MKL LAPACK+AutoTuning Speed-Up

512	0.043130	0.003948 (9)	0.003793 (1,14)	1.04
1024	0.332920	0.012877 (12)	0.011624 (3,8)	1.10
1536	1.104757	0.024598 (24)	0.024420 (6,4)	1.00
2048	2.614030	0.075525 (24)	0.076562 (4,6)	0.99
2560	5.075289	0.109087 (24)	0.165639 (6,4)	0.66
3072	8.787374	0.202955 (21)	0.237618 (6,4)	0.85
3584	13.934977	0.279215 (21)	0.323004 (4,6)	0.86
4096	20.894776	0.390708 (21)	0.383885 (4,6)	1.02

Table 2: Execution times (in seconds) obtained with di erent versions of the potrf routine: the reference LAPACK routine (LAPACK), the LAPACK routine with multithreaded MKL kernels (LAPACK+MKL) and the LAPACK routine with the auto-tuning methodology (LAPACK+Auto-Tuning). The number of threads with which the lowest times are obtained is shown. The last column shows the speed-up achieved by the auto-tuning version with respect to the LAPACK+MKL. In brackets, the number of threads with which the execution times are obtained.

apply this idea to multithreaded routines, two parameters must be selected: the number of threads and the block size. The number of threads has been selected for the dgemm routine by applying the auto-tuning methodology. Now, for the selection of the block size it is necessary to work directly with the LAPACK potrf routine, so that the block size selected by the ILAENV function can be modiﬁed in order to select the best block size.

Table 3 compares, for di erent matrix sizes, the execution time obtained for the potrf routine when the block size is internally selected by ILAENV and the execution time when the block size is selected with the auto-tuning technique. All the experiments have been done in Saturno using the same installation set: {256, 768, 1280, 1792, 2304, 2816, 3328, 3840, 4352}, with block sizes power of 2 from 32 to 512 and a number of cores from 1 to 24. When the routine potrf uses the ILAENV function to select the block size, the same value is used for several matrix sizes regardless of the number of threads. When the number of threads and the block size are selected with the auto-tuning methodology, lower execution times are obtained and the speed-up achieved is higher than that obtained in table 2 by selecting only the number of OpenMP and MKL threads. For small matrix sizes the use of larger blocks is preferable, but for larger sizes a lower value than that selected by the ILAENV function (which does not consider the number of threads, only the matrix size) is more appropriate. The improvement achieved by selecting the appropriate block size is between 6% and 30% for most matrix sizes. Therefore, if we consider the block size parameter in the auto-tuning methodology, better results are obtained for the potrf routine. Similar results are obtained for other routines, and the advantage of the auto-tuning is more apparent in larger systems.

c CMMSE

Page 272 of 1573

ISBN:978-84-615-5392-1

<<< < Предыдущая 16 17 18 19 20 21 22 23 24 25 26 2728 / 4228 29 30 31 32 33 34 35 36 37 38 39 40 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
16.05.20159.21 Mб50Traffic signs.pdf
#
16.05.201525.32 Mб50TT_1-10_Vosstanovlen (1).pdf
#
16.05.20151.23 Mб64Uchebnik_Anglyiskyi__2TD.pdf
#
18.03.2016509.23 Кб35UDC.pdf
#
18.03.20162.28 Mб53ulia_itogovaya_rabota Юрчик.docx
#
16.05.20156.75 Mб11UnEncrypted.pdf
#
15.05.2015150.53 Кб228unit_1.doc
#
15.05.2015188.42 Кб202unit_2.doc
#
21.09.2019572.42 Кб9Upravlenie_proektami_Uch_pos.doc
#
15.05.2015594.7 Кб36Uvarov_alfa_versia (3).docx
#
15.05.201533.79 Кб12V-88.doc