Quantum Monte Carlo with QMC=Chem : Basic concepts Michel Caffarel and Anthony Scemama CNRS and Université de Toulouse
Presentation of the basic notions required for the tutorial. More information on our scientific website : http :∥ qmcchem.upstlse.fr
– p. 1/37
Variational Monte Carlo (VMC) In the tutorial method = "VMC" • N electrons with positions : r1 , ..., rN
• Given a trial approximate wavefunction ΨT (r1 , ..., rN ) we want to compute the variational energy hΨT HΨT i E= hΨT ΨT i
or any property hΨT AΨT i hAi = hΨT ΨT i – p. 2/37
Variational Monte Carlo (VMC) In the tutorial the trial wavefunction used is either a HartreeFock or a CASSCF wavefunction The molecule treated will be N2 14 electrons and 7 doubleoccupied molecular orbitals : φ1 , ..., φ7 in the S = 0 groundstate
– p. 3/37
N2 molecule
φ1 = 1σ1s φ2 = 1σ∗1s (core orbitals not shown on the diagram) φ3 = 2σ2s φ4 = 2σ∗2s φ5 = 1π2px φ6 = 1π2py φ7 = 3σ2pz
– p. 4/37
Variational Monte Carlo (VMC) In quantum chemistry ΨSCF is usually expressed using space r and spin variables, σ = α (or ↑) and σ = β (or ↓) φ1 α φ1 β .. .. . ΨSCF (r1 , σ1 , ..., rN , σN ) = .. . . φ7 α φ7 β
– p. 5/37
Variational Monte Carlo (VMC) In QMC methods we use an equivalent spinfree formalism (averages are the same) = the wavefunction ΨT depends only on the spatial positions of electrons. Here, electrons 1 to 7 have been chosen as α electrons and electrons 8 to 14 as β electrons (arbitrary) φ1 (r1 ) . . . φ1 (r7 ) .. .. .. (r , ..., r ) = ΨSCF . . . 1 N T φ7 (r1 ) . . . φ7 (r7 )
φ1 (r8 ) . . . φ1 (r14 ) .. .. .. . . . φ7 (r8 ) . . . φ7 (r14 )
– p. 6/37
Variational Monte Carlo (VMC) CASSCF wavefunction based on the determinantal expansion over all possible excitations within a set of Nact active orbitals chosen among the M molecular orbitals (M = total size of the basis set) X cK = ΨCAS−SCF T β
all excitations K=(k1α ,...,k1 ,...)
φk1α (r7 ) φk1α (r1 ) . . . .. .. .. . . . φkα (r1 ) . . . φkα (r7 )(r7 ) 7 7
φk1β (r8 ) . . . φk1β (r14 ) .. .. .. . . . φ β (r8 ) . . . φ β (r14 ) k k 7
7
– p. 7/37
Variational Monte Carlo (VMC) In this tutorial HF and CASSCF wave functions from GAMESS.
– p. 8/37
Variational Monte Carlo (VMC) How to compute the variational energy ? hΨT HΨT i E= hΨT ΨT i Standard ab initio wft approaches • ΨT = expansion over a sum of determinants • Each determinant is an antisymmetrized product of molecular orbitals φi • Each molecular orbital is expressed P as a sum over primitive gaussian functions : φi = j ci j χ j
⇒ E is obtained as a huge sum of elementary oneelectron and bielectronic integrals – p. 9/37
Variational Monte Carlo (VMC) VMC= computing the very same variational energy using Monte Carlo techniques Big advantage = no need of calculating the numerous oneelectron and bielectronic integrals.
– p. 10/37
Variational Monte Carlo (VMC) VMC= generate in a probabilistic way billions of "electronic configurations" R = (r1 , ..., rN ) distributed in 3Ndimensional space according to the density probability Π(R) = R
Ψ2T (r1 , ..., rN ) dr1 , ..., drN Ψ2T (r1 , ..., rN )
The average energy can be expressed as R HΨT 2 dr ...dr Ψ (r , ..., r ) 1 N T 1 N ΨT hΨT HΨT i E= = R hΨT ΨT i dr1 ...drN Ψ2 (r1 , ..., rN ) T
E=
Z
HΨT dRΠ(R) ΨT
– p. 11/37
Variational Monte Carlo (VMC) Here, let us define the local energy EL as HΨT EL (r1 , ..., rN ) ≡ ΨT we have : K X 1 EL (R(k) ) E = lim = K→+∞ K k=1
K = number of electronic configurations generated by Monte Carlo
– p. 12/37
Metropolis algorithm How to generate electronic configurations according to Π ? Answer : Thanks to a Metropolis algorithm (many variants of it) In a Metropolis algorithm there are two steps : 1) Move the electrons with a certain probability 2) Accept or reject this move according to some probability
– p. 13/37
Metropolis algorithm A first type of Metropolis algorithm : sampling="Brownian" 1) Electronic configurations are moved according to : p(R → R′ ) = b = drift vector=
∇ψT ψT
1 (2πτ)3N/2
(R′ − R − b(R)τ)2 ] exp[− 2τ
and τ = timestep
2) Configurations are accepted/rejected with probability q Π(R′ )p(R′ → R) q = Min[1, ] ′ Π(R)p(R → R ) A random number χ is drawn ∈ (0,1). If q > χ move accepted, if not rejected. – p. 14/37
Metropolis algorithm In practice, drawing new configurations R′ according to p(R → R′ ) is very simple. The probability is a product of 3N independent 1D probability density ′
p(R → R ) =
√
N Y i=1
1 2πτ
√
1 2πτ
√
1 2πτ
exp [−
exp [−
(x′i − xi − bx i (R)τ)2 ) 2τ
(y′i − yi − b y i (R)τ)2
exp [−
2τ
(z′i − zi − bz i (R)τ)2 2τ
]
]
]
– p. 15/37
Metropolis algorithm Simply draw 3N independent gaussian numbers η with zero mean and variance 1 x′i −xi −bxi (R)τ √ τ y
= ηxi
y′i −yi −bi (R)τ √ τ
= ηi
z′i −zi −bzi (R)τ √ τ
= ηzi
or
y
√
x′i = xi + bxi (R)τ + τηxi √ y y ′ yi = yi + bi (R)τ + τηi √ z z ′ zi = zi + bi (R)τ + τηi
– p. 16/37
Metropolis algorithm Second type : sampling="Langevin" Use of the true Langevin equation : dR(t) = P(t)/mdt dP(t) = −gradV(R(t))dt − γP(t)dt + σdW(t) Instead of dR(t) = b(R(t))dt + σdW(t) See, A. Scemama, T. Lelièvre, G. Stoltz, E. Cancès, and M.C J. Chem. Phys. vol.125, 114105 (2006).
– p. 17/37
Summary about VMC To make a VMC calculation we need to compute at each Monte Carlo step : • Values of ΨT , b =
∇ψT ψT ,
and EL = HψT /ψT = −1/2∇2 ΨT /ΨT
• Draw gaussian and random numbers
Very easy to do, no one or twoelectron integrals to compute. Important consequence : any form for the trial wavefunction can be used In the tutorial : HF and CASSCF wavefunctions so we can compare the standard approach and the Monte Carlo one. In practice, more sophisticated wavefunctions whose averages cannot be calculated using the standard approach are used. – p. 18/37
SlaterJastrow trial wavefunction Most popular form ΨT = e
J(r1 ,...,rN )
Ndet X k=1
β α cK DetK ({φki })Detk ({φk }) i
with (typically) X X XX J(r1 , . . . , rN ) = ve−e (ri j )+ ve−n (riα )+ ve−e−n (ri j , riα , r jα ) i< j
α
i< j
α
Electronelectron CUSP conditions : when electrons i and j are very close ve−e (ri j ) → 1/2 ri j (spinunlike) or 1/4 ri j (spinlike)
– p. 19/37
Other trial wavefunctions • JastrowGeminal wavefunction (Casula, Sorella and coll.) • Wavefunctions with backflow (Drummond, Rios, Needs, Mitas, and coll.) • Pfaffian wavefunction (Mitas and coll.) • MultiJastrow wavefunction (Bouabça, MC and coll.) • VBtype wavefunctions (Braida, MC, Goddard and coll.) • etc.
– p. 20/37
FixedNode Diffusion Monte Carlo (DMC) DMC = VMC (move + acceptance/rejection step) + Birthdeath (branching) process Birthdeath process : After VMC step the electronic configuration is duplicated into M copies (M = 0, 1, 2, ..) M = Integer_part(exp [−τ(EL − Ere f )] + u) where u is a uniform random number in (0,1) (so that the average number of copies is equal to exp [−τ(EL − Ere f )])
– p. 21/37
Population of walkers in DMC Because of the possible variation of the number of electronic configurations, to perform a DMC calculations we need to introduce a population of electronic configurations (or walkers). In QMC=Chem this number is denoted walk_num . It can be shown that by adjusting Ere f = E0 the size of population of walkers can be kept constant in average. Note that in VMC it is not necessary to introduce a population of walkers but in general we also consider walk_num walkers.
– p. 22/37
FixedNode Diffusion Monte Carlo (DMC) Using DMC rules the density obtained is Π(R) = R
ΨT (r1 , ..., rN )ΦFN (r1 , ..., rN ) 0 dr1 , ..., drN ΨT (r1 , ..., rN )Φ0 (r1 , ..., rN )
where ΦFN (r1 , ..., rN ) = solution of the Schrödinger equation : 0 FN FN (r , ..., r ) = E HΦFN 1 N 0 Φ0 (r1 , ..., rN ) 0
(r1 , ..., rN ) = 0 with the FixedNode (FN) constraint : ΦFN 0 whenever ΨT (r1 , ..., rN ) = 0
– p. 23/37
FixedNode Diffusion Monte Carlo (DMC) The estimate of the exact energy is obtained as in VMC by averaging the local energy R R HΨ T FN dr1 ...drN ΨT φFN dr ...dr φ HΨT 1 N 0 ΨT 0 = R = EFN << EL >>= R 0 FN dr1 ...drN ΨT φFN dr ...dr Ψ φ 1 N T 0 0 • In general the fixednode error is small (a few percents of the correlation energy) • Variational property : EFN ≥ E0 0
– p. 24/37
FixedNode Diffusion Monte Carlo (DMC) The expectation value of an observable A is biased in DMC R R dr1 ...drN ΨT φ0 A dr1 ...drN φ20 A < A >= R , R dr1 ...drN ΨT φ0 dr1 ...drN φ20 A simple way of improving the average
< A >∼ 2 < A >DMC − < A >VMC Indeed : hφ0 Aφ0 i = hΨT + δφAΨT + δφi where δφ = φ0 − ΨT
hφ0 Aφ0 i = hΨT AΨT i + 2hδφAΨT i + O(δφ2 ) hφ0 Aφ0 i = 2hφ0 AΨT i − hΨT AΨT i + O(δφ2 )
– p. 25/37
QMC is fully parallelizable
– p. 26/37
Do not forget to show the video !
– p. 27/37
Benchmarks • S. Manten and A. Luchow, J.Chem.Phys. 115, 5362 (2001). • J.C. Grossman J.Chem.Phys. 117, 1434 (2002). • Nemec et al., J.Chem.Phys. 132, 034111 (2010). G1 set Pople et collab. (1990) = 55 molecules. Atomization energies. Mean Absolute Deviation ǫMAD • FNDMC : ǫMAD = 2.9kcal/mol • LDA : ǫMAD ∼ 40kcal/mol • (B3LYP et B3PW91) ǫMAD ∼ 2.5kcal/mol • CCSD(T)/augccpVQZ ǫMAD ∼ 2.8kcal/mol – p. 28/37
QMC simulations for Amyloid β
Aβ(2835) βStrand structure simulated with QMC
Aβ(2835) αHelix structure simulated with QMC – p. 29/37
Petascale QMC simulations • Simulations done on CURIE supercomputer at CEA, France, december 2011. Key people : Anthony Scemama (LCPQ) and Bull engineers • QMC simulations involving 122 atoms and 432 electrons (largest chemical system to date). • Use of about 80 000 cores during about 24h, ∼ 960 Tflops/s (real performance) • 32.5% of the peak performance of the INTEL SANDY BRIDGE processor thanks to optimization tools (Exascale Computing Research Lab.)
– p. 30/37
Practical aspects : Convergence of energy As we have seen in VMC and DMC the energy is estimated as the average of the local energy over K electronic configurations K 1X EL (R(k) ) E∼ K k=1
How to get an estimation of the statistical error for E ?.
– p. 31/37
Practical aspects : Convergence of energy 1) Generation of a certain number of "blocks" Nb defined as a set of walk_num walkers having made num_step moves 2) Each block k contains about Nk = walk_num × num_step values of the local energy leading to Nk X 1 Ek = EL (R(i) ) Nk i=1
3) If num_step is sufficiently large, such blocks must be statistically independent and gaussian distributed (centrallimit theorem for Markovian processes)
– p. 32/37
Practical aspects : Convergence of energy 4) The energy is then estimated as the average over the Nb blocks Nb X 1 E= Ek Nb k=1
and the error by σ δE = √ Nb where σ is an estimate of the standard deviation of the gaussian distribution v u t Nb X 1 σ= (Ek − E)2 Nb − 1 k=1
– p. 33/37
Practical aspects : Convergence of energy The final result is written under the form E ± δE Exact result within one standard deviation (E ± δE) : probability = 68.2% Exact result within two standard deviations (E ± 2δE) : probability = 95.4%
– p. 34/37
Practical aspects : Convergence of energy It is important to check 1) The transient regime is attained. For that the evolution of E as a function of the number of blocks must be looked at. 2) The distribution of blocks is indeed gaussian.
– p. 35/37
Other practical aspects 1) Choice of the value of the timestep 2) Treatment of electronnucleus CUSP
– p. 36/37
Other practical aspects 1) Choice of the value of the timestep 2) Treatment of electronnucleus CUSP
– p. 37/37
Implementation of parallelism in QMC=Chem Anthony Scemama
Michel Caffarel Labratoire de Chimie et Physique Quantiques IRSAMC (Toulouse)
Parallelization of VMC In VMC, all the trajectories are completely independent:
1
• Pack together a pool of N walk walkers • Cut the trajectories in smaller pieces of equal size (N step) • Each CPU computes a block: N walk executing N step
Nstep Nwalk
Nproc
CPU time 2
Naive implementation: • Parallelize with MPI • At the end of each block, call MPI_all_reduce to update the running averages • If too much memory is used, eventually add an OpenMP layer This approach is not optimal *: • At every synchronization, all processes will wait until the slowest has finished. Perfect parallel speedup is impossible to obtain. • The calculation can't start until all resources are available • If one compute node crashes, all the simulation is crashed. • If more resources become available, it is impossible to attach more CPUs to a running calculation *
3
"Manager–workerbased model for the parallelization of quantum Monte Carlo on heterogeneous and homogeneous networks", M. T. Feldmann, J. C. Cummings, D. R. Kent, R. P. Muller, W. A. Goddard III, J. Comput. Chem. 29, 8–16 (2008).
Our approach † : • Manager/worker model: all CPUs are desynchronized, they start immediately • The length of the block is not fixed: termination is immediate • Use as less memory/core as possible ( <100 MiB / core ) • Implement a client/server model (in Python): • allows client nodes to crash • allows to dynamically add/remove clients • Avoid the traditional input/output file model. All data is stored in a database, and data is postprocessed. • Possibility to use computing grids (EGI ‡)
4
†
"Quantum monte carlo for large chemical systems: Implementing efficient strategies for petascale platforms and beyond", A. Scemama, M. Caffarel, E. Oseret and W. Jalby, J. Comput. Chem 34, 938–951 (2013).
‡
"Largescale quantum Monte Carlo electronic structure calculations on the EGEE grid", A. Monari, A. Scemama and M. Caffarel, Remote Instrumentation for eScience and Related Aspects, 195207, Springer (2012).
Master compute node Data Server Manager
Main worker thread
Network Thread
I/O Thread
Database
Slave Compute node
Worker
Forwarder
Forwarder
Worker
Worker
Worker
Worker
• All I/O and network communications are nonblocking • Worker: Singlecore Fortran executable piped to a forwarder • A Worker stops cleanly when its receives the SIGTERM signal
5
Worker
Faulttolerance
• No access to the filesystem: scripts, binary and input data are broadcasted to the client nodes and stored in /dev/shm. Local disks can crash. • Blocks have a Gaussian distribution. Losing blocks doesn't change the average. Any worker can be removed. • Every forwarder can always reach the data server. Any node can be removed. • If the data server is lost, it is always possible to continue the calculation in another run. 6
Parallelization of DMC • In the standard DMC algorithm, the walkers are no more independent. • Communications are expected to kill the ideal speedup. • The Pure Diffusion MC algorithm § allows to obtain the DMC energy with reweighting instead of branching: no more communication. • PDMC introduces too much fluctuations in the total energies • We use an algorithm that combines branching and reweighting. ¶ Small populations can be used, and multiple independent runs can be done.
7
§
"Development of a pure diffusion quantum Monte Carlo method using a full generalized Feynman–Kac formula.", M. Caffarel, J. Chem. Phys. 88, 1088 (1988)
¶
"Diffusion monte carlo methods with a fixed number of walkers", R. Assaraf, M. Caffarel, A. Khelif, Phys Rev E 61(4 Pt B), 456675 (2000).
Why a database? • Input and output data are tightly linked • An output file can be generated on demand • Easy connection to GUI • An API simplifies scripting to manipulate results • Checkpoint/restart is trivial • Additional calculation can be done even if the calculation is finished. No need to rerun. • Combining results obtained on different datacenters is trivial • Multiple independent runs can write in the same database : dynamic number of CPUs. • The name of the database is an MD5 key, corresponding to critical input data.
8
Initial conditions • Different initial walker positions are needed • At the end of each block, the final positions are sent to the forwarder • Each forwarder keeps a sample of the populations of its workers • Sometimes, the forwarder sends its walkers to its parent in the tree • The data server receives a sample of the population of all the walkers and merges it with its population • Periodically, the population is saved to disk • When a new run is started, each worker gets N walk walkers drawn randomly from this population
9
Termination When the manager wants to terminate the calculation (catches SIGTERM, or termination condition reached): • It sends to the leaves of the tree a termination signal • The leaves send a SIGTERM to the workers • Each forwarder gets the data of the last blocks from the workers • When the workers have terminated, the forwarder sends the data to its parent with a termination signal • When the data server receives the termination signal, it the calculation is finished
10
Parallel speedup
Estimation checked on 100 nodes/1 hour. Accuracy of 0.9992
11
The parallel speedup is almost ideal. Singlecore optimization is crucial : Every percent gained on one core will be gained on the parallel simulation
12
Easy and effficient programming with IRPF90 Anthony Scemama Michel Caffarel Labratoire de Chimie et Physique Quantiques IRSAMC (Toulouse)
Introduction • A program is a function of its input data: output = program(input) • A program can be represented as a tree where: • the vertices are the variables • the edges represent the relation 'depends on' • The root of the tree is the output of the program • The leaves are the input data
1
u (x, y) = x + y + 1 v (x, y) = x + y + 2 w (x) = x + 3 t (x, y) = x + y + 4 This production tree computes t( u(d1, d2), v( u(d3, d4), w(d5) ) ) 2
Usual programming program exemple_1 implicit none integer :: d1,d2,d3,d4,d5 integer :: u1, u2, w, v integer :: t call call call call call call
read_data(d1,d2,d3,d4,d5) compute_u(d1,d2,u1) compute_u(d3,d4,u2) compute_w(d5,w) compute_v(u2,w,v) compute_t(u1,v,t)
print * , 't=', t end program
3
! Input data ! Temporary variables ! Output data
Alternative way with functions program exemple_2 implicit none integer :: d1,d2,d3,d4,d5 ! Input data integer :: u1, u2, w, v, t ! Variables integer :: compute_u,compute_t,compute_w,compute_w call u1 = u2 = w = v = t =
read_data(d1,d2,d3,d4,d5) compute_u(d1,d2) compute_u(d3,d4) compute_w(d5) compute_v(u2,w) compute_t(u1,v)
print * , 't=', t end program
4
Singleline with functions program exemple_3 implicit none integer :: d1,d2,d3,d4,d5 integer :: u, v, w, t
! Input data
call read_data(d1,d2,d3,d4,d5) print * , 't=', & t( u(d1,d2), v( u(d3,d4), w(d5) ) ) end program Now, the sequence of execution is handled by the compiler.
5
Same example with IRPF90 program exemple_4 implicit none print * , 't=', t end program That's it! • Using t triggers the exploration of the production tree • Completely equivalent to the previous example, but the parameters of the function t are not expressed • IRP : Implicit Reference to Parameters
6
Definition of the nodes of the tree For each node, we write a provider. This is a subroutine whose role is to build the variable and guarantee that it is built properly. file: uvwt.irp.f BEGIN_PROVIDER [ integer, t ] t = u1+v+4 END_PROVIDER BEGIN_PROVIDER [integer,w] w = d5+3 END_PROVIDER BEGIN_PROVIDER [ integer, v ] v = u2+w+2 END_PROVIDER
7
BEGIN_PROVIDER [ integer, u1 ] u1 = d1+d2+1 END_PROVIDER BEGIN_PROVIDER [ integer, u2 ] u2 = d3+d4+1 END_PROVIDER
8
file : input.irp.f BEGIN_PROVIDER &BEGIN_PROVIDER &BEGIN_PROVIDER &BEGIN_PROVIDER &BEGIN_PROVIDER read(*,*) d1 read(*,*) d2 read(*,*) d3 read(*,*) d4 read(*,*) d5 END_PROVIDER
9
[ [ [ [ [
integer, integer, integer, integer, integer,
d1 d2 d3 d4 d5
] ] ] ] ]
When you write a provider for x, you only have to focus on • How do I build x? • What are the variables that I need to build x? • Am I sure that x is built correctly when I exit the provider? Using this method: • You don't have to know the execution sequence • If you need a variable (node), you are sure that it has been built properly when you use it • You will never break other parts of the program • Many people can work simultaneously on the same program with minimal effort • If a node has already been built, it will not be built again. The correct value will be returned by the provider.
10
Fortran code generation • Run irpf90 in the current directory • irpf90 reads all the *.irp.f files • All the providers are identified • All the corresponding variables (IRP entities) are searched for in the code • The dependence tree is built • Providers are transformed to subroutines (subroutine provide_*) • Calls to provide_* are inserted in the code • Each file *.irp.f generates a module containing the IRP entities, and a Fortran file containing the subroutines/functions • As the dependence tree is built, the dependences between the files are known and the makefile is built automatically
11
12
Generated code example ! * F90 *! !! ! This file was generated with the irpf90 tool. ! ! ! ! DO NOT MODIFY IT BY HAND ! !! program irp_program call irp_example1 call irp_finalize_742559343() end program subroutine irp_example1 use uvwt_mod implicit none character*(12) :: irp_here = 'irp_example1'
13
! ! ! ! !
irp_example1: 0 irp_example1.irp.f: irp_example1.irp.f: irp_example1.irp.f: irp_example1.irp.f:
0 0 0 1
! irp_example1.irp.f: ! irp_example1.irp.f:
2 1
if (.not.t_is_built) then call provide_t endif print *, 't = ', t end
! irp_example1.irp.f: ! irp_example1.irp.f:
! * F90 *! !! ! This file was generated with the irpf90 tool. ! ! ! ! DO NOT MODIFY IT BY HAND ! !! module uvwt_mod integer :: u1 logical :: u1_is_built = .False. integer :: u2 logical :: u2_is_built = .False. 14
3 4
integer :: t logical :: t_is_built = .False. integer :: w logical :: w_is_built = .False. integer :: v logical :: v_is_built = .False. end module uvwt_mod ! * F90 *! !! ! This file was generated with the irpf90 tool. ! ! ! ! DO NOT MODIFY IT BY HAND ! !! subroutine provide_u1 use uvwt_mod 15
use input_mod implicit none character*(10) :: irp_here = 'provide_u1' integer :: irp_err logical :: irp_dimensions_OK if (.not.d1_is_built) then call provide_d1 endif if (.not.u1_is_built) then call bld_u1 u1_is_built = .True. endif end subroutine provide_u1 subroutine bld_u1 use uvwt_mod use input_mod 16
character*(2) :: irp_here = 'u1' ! uvwt.irp.f: u1 = d1+d2+1 ! uvwt.irp.f: end subroutine bld_u1 subroutine provide_u2 use uvwt_mod use input_mod implicit none character*(10) :: irp_here = 'provide_u2' integer :: irp_err logical :: irp_dimensions_OK if (.not.d1_is_built) then call provide_d1 endif if (.not.u2_is_built) then call bld_u2 u2_is_built = .True. endif
17
13 14
end subroutine provide_u2 subroutine bld_u2 use uvwt_mod use input_mod character*(2) :: irp_here = 'u2' ! uvwt.irp.f: u2 = d3+d4+1 ! uvwt.irp.f: end subroutine bld_u2 subroutine provide_t use uvwt_mod implicit none character*(9) :: irp_here = 'provide_t' integer :: irp_err logical :: irp_dimensions_OK if (.not.u1_is_built) then call provide_u1 endif if (.not.v_is_built) then
18
17 18
call provide_v endif if (.not.t_is_built) then call bld_t t_is_built = .True. endif end subroutine provide_t subroutine bld_t use uvwt_mod character*(1) :: irp_here = 't' t = u1+v+4 end subroutine bld_t subroutine provide_w use uvwt_mod use input_mod implicit none
19
! uvwt.irp.f: ! uvwt.irp.f:
1 2
character*(9) :: irp_here = 'provide_w' integer :: irp_err logical :: irp_dimensions_OK if (.not.d1_is_built) then call provide_d1 endif if (.not.w_is_built) then call bld_w w_is_built = .True. endif end subroutine provide_w subroutine bld_w use uvwt_mod use input_mod character*(1) :: irp_here = 'w' w = d5+3
20
! uvwt.irp.f: ! uvwt.irp.f:
5 6
end subroutine bld_w subroutine provide_v use uvwt_mod implicit none character*(9) :: irp_here = 'provide_v' integer :: irp_err logical :: irp_dimensions_OK if (.not.w_is_built) then call provide_w endif if (.not.u2_is_built) then call provide_u2 endif if (.not.v_is_built) then call bld_v v_is_built = .True. endif 21
end subroutine provide_v subroutine bld_v use uvwt_mod character*(1) :: irp_here = 'v' v = u2+w+2 end subroutine bld_v
Code execution with debug mode on: $ ./irp_example1 0 : > provide_t 0 : > provide_u1 0 : > provide_d1 0 : > d1 1 2 3 4
22
! uvwt.irp.f: ! uvwt.irp.f:
9 10
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23
: : : : : : : : : : : : : : : : :
< d1 0.0000000000000000 < provide_d1 0.0000000000000000 > u1 < u1 0.0000000000000000 < provide_u1 0.0000000000000000 > provide_v > provide_w > w < w 0.0000000000000000 < provide_w 0.0000000000000000 > provide_u2 > u2 < u2 0.0000000000000000 < provide_u2 0.0000000000000000 > v < v 0.0000000000000000 < provide_v 0.0000000000000000
0 0 0 0 t =
24
: > t : < t 0.0000000000000000 : < provide_t 0.0000000000000000 : > irp_example1 26 0 : < irp_example1 0.0000000000000000
Using subroutines/functions BEGIN_PROVIDER [ integer, u1 ] integer :: fu u1 = fu(d1,d2) END_PROVIDER BEGIN_PROVIDER [ integer, u2 ] integer :: fu u2 = fu(d3,d4) END_PROVIDER integer function fu(x,y) integer :: x,y fu = x+y+1 end function
25
Providing arrays An array is considered built when all its elements are built. Its dimensions can be provided variables, constants and intervals (a:b). BEGIN_PROVIDER [ integer, fact_max ] fact_max = 10 END_PROVIDER BEGIN_PROVIDER [ double precision, fact, (0:fact_max) ] integer :: i fact(0) = 1.d0 do i=1,fact_max fact(i) = fact(i1)*dble(i) enddo END_PROVIDER
26
program test print *, fact(5) end $ ./test 0 : > provide_fact 0 : > provide_fact_max 0 : > fact_max 0 : < fact_max 0.0000000000000000 0 : < provide_fact_max 0.0000000000000000 0 : > fact 0 : < fact 0.0000000000000000 0 : < provide_fact 0.0000000000000000 0 : > test 120.00000000000000 0 : < test 0.0000000000000000
The allocation behaves as follows: • If the array is not already allocated, it is allocated 27
• If the array already allocated, check if the dimensions have changed • If the dimensions have not changed, then OK. • Else deallocate the array and reallocate it with the correct dimensions • All allocations/deallocations are checked with stat=err ! * F90 *! !! ! This file was generated with the irpf90 tool. ! ! ! ! DO NOT MODIFY IT BY HAND ! !! subroutine provide_fact_max use fact_mod implicit none character*(16) :: irp_here = 'provide_fact_max' integer :: irp_err 28
logical :: irp_dimensions_OK if (.not.fact_max_is_built) then call bld_fact_max fact_max_is_built = .True. endif end subroutine provide_fact_max subroutine bld_fact_max use fact_mod character*(8) :: irp_here = 'fact_max' ! fact.irp.f: fact_max = 10 ! fact.irp.f: end subroutine bld_fact_max subroutine provide_fact use fact_mod implicit none character*(12) :: irp_here = 'provide_fact' integer :: irp_err
29
1 2
logical :: irp_dimensions_OK if (.not.fact_max_is_built) then call provide_fact_max endif if (allocated (fact) ) then irp_dimensions_OK = .True. irp_dimensions_OK = irp_dimensions_OK.AND. & (SIZE(fact,1)==(fact_max  (1))) if (.not.irp_dimensions_OK) then deallocate(fact,stat=irp_err) if (irp_err /= 0) then print *, irp_here//': Deallocation failed: fact' print *, ' size: (0:fact_max)' endif if ((fact_max  (1)>0)) then allocate(fact(0:fact_max),stat=irp_err) if (irp_err /= 0) then print *, irp_here//': Allocation failed: fact' 30
print *, ' size: (0:fact_max)' endif endif endif else if ((fact_max  (1)>0)) then allocate(fact(0:fact_max),stat=irp_err) if (irp_err /= 0) then print *, irp_here//': Allocation failed: fact' print *, ' size: (0:fact_max)' endif endif endif if (.not.fact_is_built) then call bld_fact fact_is_built = .True. endif 31
end subroutine provide_fact subroutine bld_fact use fact_mod character*(4) :: irp_here = 'fact' integer :: i fact(0) = 1.d0 do i=1,fact_max fact(i) = fact(i1)*dble(i) enddo end subroutine bld_fact
32
! ! ! ! ! !
fact.irp.f: fact.irp.f: fact.irp.f: fact.irp.f: fact.irp.f: fact.irp.f:
5 6 8 9 10 11
Modifying a variable outside of its provider In iterative processes, a variable needs to be modified outside of its provider. If it is the case, IRPF90 has to be informed of the change by the TOUCH keyword. Example: computing numerical derivatives BEGIN_PROVIDER [ real, dPsi ] x += 0.5*delta_x TOUCH x dPsi = Psi x = delta_x TOUCH x dPsi = (dPsi  Psi)/delta_x x += 0.5*delta_x SOFT_TOUCH x END_PROVIDER
33
Generated code: ! * F90 *! !! ! This file was generated with the irpf90 tool. ! ! ! ! DO NOT MODIFY IT BY HAND ! !! subroutine provide_dpsi use y_mod use x_mod implicit none character*(12) :: irp_here = 'provide_dpsi' integer :: irp_err logical :: irp_dimensions_OK if (.not.x_is_built) then call provide_x 34
endif if (.not.psi_is_built) then call provide_psi endif if (.not.delta_x_is_built) then call provide_delta_x endif if (.not.dpsi_is_built) then call bld_dpsi dpsi_is_built = .True. endif end subroutine provide_dpsi subroutine bld_dpsi use y_mod use x_mod use y_mod
35
! x.irp.f:
3
use y_mod use y_mod character*(4) :: irp_here = 'dpsi' x =x +( 0.5*delta_x) ! ! >>> TOUCH x call touch_x ! <<< END TOUCH if (.not.x_is_built) then call provide_x endif if (.not.psi_is_built) then call provide_psi endif if (.not.delta_x_is_built) then call provide_delta_x endif dPsi = Psi
36
! ! ! ! ! ! ! !
x.irp.f: x.irp.f: x.irp.f: x.irp.f: x.irp.f: x.irp.f: x.irp.f: x.irp.f:
6 9 1 2 3 3 3 3
! x.irp.f:
4
x =x ( delta_x) ! ! >>> TOUCH x call touch_x ! <<< END TOUCH if (.not.x_is_built) then call provide_x endif if (.not.psi_is_built) then call provide_psi endif if (.not.delta_x_is_built) then call provide_delta_x endif dPsi = (dPsi  Psi)/delta_x x =x +( 0.5*delta_x) ! ! >>> TOUCH x
37
! ! ! ! !
x.irp.f: x.irp.f: x.irp.f: x.irp.f: x.irp.f:
5 6 6 6 6
! ! ! !
x.irp.f: x.irp.f: x.irp.f: x.irp.f:
7 8 9 9
call touch_x ! <<< END TOUCH (Soft) end subroutine bld_dpsi
How this works:
38
! x.irp.f: ! x.irp.f:
9 9
Templates When pieces of code are very similar, it is possible to use a template: BEGIN_TEMPLATE subroutine insertion_$Xsort (x,iorder,isize) implicit none $type,intent(inout) :: x(isize) integer,intent(inout) :: iorder(isize) integer,intent(in) :: isize $type :: xtmp integer :: i, i0, j, jmax do i=1,isize xtmp = x(i) i0 = iorder(i) j = i1
39
do j=i1,1,1 if ( x(j) > xtmp ) then x(j+1) = x(j) iorder(j+1) = iorder(j) else exit endif enddo x(j+1) = xtmp iorder(j+1) = i0 enddo end SUBST [ X, type ] ; real ;; d ; double precision ;; 40
i ; integer ;; END_TEMPLATE Generated code: ! * F90 *! !! ! This file was generated with the irpf90 tool. ! ! ! ! DO NOT MODIFY IT BY HAND ! !! subroutine insertion_sort (x,iorder,isize) implicit none character*(14) :: irp_here='insertion_sort' real,intent(inout) :: x(isize) integer,intent(inout) :: iorder(isize)
41
!x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35:
3 4 3 5 6
integer,intent(in) :: isize real :: xtmp integer :: i, i0, j, jmax do i=1,isize xtmp = x(i) i0 = iorder(i) j = i1 do j=i1,1,1 if ( x(j) > xtmp ) then x(j+1) = x(j) iorder(j+1) = iorder(j) else exit endif enddo x(j+1) = xtmp iorder(j+1) = i0 enddo
42
!x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35:
7 8 9 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
end !x.irp.f_tpl_35: subroutine insertion_dsort (x,iorder,isize) !x.irp.f_tpl_35: implicit none !x.irp.f_tpl_35: character*(15) :: irp_here='insertion_dsort'!x.irp.f_tpl_35: double precision,intent(inout) :: x(isize)!x.irp.f_tpl_35: integer,intent(inout) :: iorder(isize) !x.irp.f_tpl_35: integer,intent(in) :: isize !x.irp.f_tpl_35: double precision :: xtmp !x.irp.f_tpl_35: integer :: i, i0, j, jmax !x.irp.f_tpl_35: do i=1,isize !x.irp.f_tpl_35: xtmp = x(i) !x.irp.f_tpl_35: i0 = iorder(i) !x.irp.f_tpl_35: j = i1 !x.irp.f_tpl_35: do j=i1,1,1 !x.irp.f_tpl_35: if ( x(j) > xtmp ) then !x.irp.f_tpl_35: x(j+1) = x(j) !x.irp.f_tpl_35: iorder(j+1) = iorder(j) !x.irp.f_tpl_35: else !x.irp.f_tpl_35:
43
27 32 33 32 34 35 36 37 38 40 41 42 43 44 45 46 47 48
exit !x.irp.f_tpl_35: endif !x.irp.f_tpl_35: enddo !x.irp.f_tpl_35: x(j+1) = xtmp !x.irp.f_tpl_35: iorder(j+1) = i0 !x.irp.f_tpl_35: enddo !x.irp.f_tpl_35: end !x.irp.f_tpl_35: subroutine insertion_isort (x,iorder,isize) !x.irp.f_tpl_35: implicit none !x.irp.f_tpl_35: character*(15) :: irp_here='insertion_isort'!x.irp.f_tpl_35: integer,intent(inout) :: x(isize) !x.irp.f_tpl_35: integer,intent(inout) :: iorder(isize) !x.irp.f_tpl_35: integer,intent(in) :: isize !x.irp.f_tpl_35: integer :: xtmp !x.irp.f_tpl_35: integer :: i, i0, j, jmax !x.irp.f_tpl_35: do i=1,isize !x.irp.f_tpl_35: xtmp = x(i) !x.irp.f_tpl_35: i0 = iorder(i) !x.irp.f_tpl_35:
44
49 50 51 52 53 54 56 61 62 61 63 64 65 66 67 69 70 71
j = i1 do j=i1,1,1 if ( x(j) > xtmp ) then x(j+1) = x(j) iorder(j+1) = iorder(j) else exit endif enddo x(j+1) = xtmp iorder(j+1) = i0 enddo end
45
!x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35: !x.irp.f_tpl_35:
72 73 74 75 76 77 78 79 80 81 82 83 85
Metaprogramming Shell scripts can be inserted in the IRPF90 code, and the output of the script will be inserted in the generated Fortran. For example: program test BEGIN_SHELL [ /bin/bash ] echo print *, \'Compiled by $(whoami) on $(date)\' END_SHELL end Generated code: ! * F90 *! !! ! This file was generated with the irpf90 tool. ! ! ! ! DO NOT MODIFY IT BY HAND ! 46
!! program irp_program call test call irp_finalize_491024427() end program subroutine test character*(4) :: irp_here = 'test' print *, 'Compiled by scemama on Mon Jul 8 11:28:16 CEST 2013' end
Example: Computing powers of x BEGIN_SHELL [ /usr/bin/python ] POWER_MAX = 20 def compute_x_prod(n,d): if n == 0: d[0] = None return d if n == 1: d[1] = None 47
! ! ! ! ! ! ! !
test: 0 test.irp.f: 0 test.irp.f: 0 test.irp.f: 0 test.irp.f: 1 test.irp.f: 1 test.irp.f_shell_4: test.irp.f: 5
1
return d if n in d: return d m = n/2 d = compute_x_prod(m,d) d[n] = None d[2*m] = None return d def print_subroutine(n): keys = compute_x_prod(n,{}).keys() keys.sort() output = [] print "real function power_%d(x1)"%n print " real, intent(in) :: x1" for i in range(1,len(keys)): output.append( "x%d"%keys[i] ) if output != []: 48
print " real :: "+', '.join(output) for i in range(1,len(keys)): ki = keys[i] ki1 = keys[i1] if ki == 2*ki1: print " x%d"%ki + " = x%d * x%d"%(ki1,ki1) else: print " x%d"%ki + " = x%d * x1"%(ki1) print " power_%d = x%d"%(n,n) print "end" for i in range(POWER_MAX): print_subroutine (i+1) print '' END_SHELL
49
! * F90 *! !! ! This file was generated with the irpf90 tool. ! ! ! ! DO NOT MODIFY IT BY HAND ! !! real function power_1(x1) character*(7) :: irp_here = 'power_1' real, intent(in) :: x1 power_1 = x1 end real function power_2(x1) character*(7) :: irp_here = 'power_2' real, intent(in) :: x1 real :: x2 x2 = x1 * x1
50
! ! ! ! ! ! ! ! ! !
power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44:
1 1 2 3 4 6 6 7 8 9
power_2 = x2 end real function power_3(x1) character*(7) :: irp_here = 'power_3' real, intent(in) :: x1 real :: x2, x3 x2 = x1 * x1 x3 = x2 * x1 power_3 = x3 end real function power_4(x1) character*(7) :: irp_here = 'power_4' real, intent(in) :: x1 real :: x2, x4 x2 = x1 * x1 x4 = x2 * x2 power_4 = x4 end
51
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44:
10 11 13 13 14 15 16 17 18 19 21 21 22 23 24 25 26 27
real function power_5(x1) character*(7) :: irp_here = 'power_5' real, intent(in) :: x1 real :: x2, x4, x5 x2 = x1 * x1 x4 = x2 * x2 x5 = x4 * x1 power_5 = x5 end real function power_6(x1) character*(7) :: irp_here = 'power_6' real, intent(in) :: x1 real :: x2, x3, x6 x2 = x1 * x1 x3 = x2 * x1 x6 = x3 * x3 power_6 = x6 end
52
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44:
29 29 30 31 32 33 34 35 36 38 38 39 40 41 42 43 44 45
real function power_7(x1) character*(7) :: irp_here = 'power_7' real, intent(in) :: x1 real :: x2, x3, x6, x7 x2 = x1 * x1 x3 = x2 * x1 x6 = x3 * x3 x7 = x6 * x1 power_7 = x7 end real function power_8(x1) character*(7) :: irp_here = 'power_8' real, intent(in) :: x1 real :: x2, x4, x8 x2 = x1 * x1 x4 = x2 * x2 x8 = x4 * x4 power_8 = x8
53
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44:
47 47 48 49 50 51 52 53 54 55 57 57 58 59 60 61 62 63
end ! power.irp.f_shell_44: real function power_9(x1) ! power.irp.f_shell_44: character*(7) :: irp_here = 'power_9' ! power.irp.f_shell_44: real, intent(in) :: x1 ! power.irp.f_shell_44: real :: x2, x4, x8, x9 ! power.irp.f_shell_44: x2 = x1 * x1 ! power.irp.f_shell_44: x4 = x2 * x2 ! power.irp.f_shell_44: x8 = x4 * x4 ! power.irp.f_shell_44: x9 = x8 * x1 ! power.irp.f_shell_44: power_9 = x9 ! power.irp.f_shell_44: end ! power.irp.f_shell_44: real function power_10(x1) ! power.irp.f_shell_44: character*(8) :: irp_here = 'power_10' ! power.irp.f_shell_44: real, intent(in) :: x1 ! power.irp.f_shell_44: real :: x2, x4, x5, x10 ! power.irp.f_shell_44: x2 = x1 * x1 ! power.irp.f_shell_44: x4 = x2 * x2 ! power.irp.f_shell_44: x5 = x4 * x1 ! power.irp.f_shell_44:
54
64 66 66 67 68 69 70 71 72 73 74 76 76 77 78 79 80 81
x10 = x5 * x5 ! power.irp.f_shell_44: power_10 = x10 ! power.irp.f_shell_44: end ! power.irp.f_shell_44: real function power_11(x1) ! power.irp.f_shell_44: character*(8) :: irp_here = 'power_11' ! power.irp.f_shell_44: real, intent(in) :: x1 ! power.irp.f_shell_44: real :: x2, x4, x5, x10, x11 ! power.irp.f_shell_44: x2 = x1 * x1 ! power.irp.f_shell_44: x4 = x2 * x2 ! power.irp.f_shell_44: x5 = x4 * x1 ! power.irp.f_shell_44: x10 = x5 * x5 ! power.irp.f_shell_44: x11 = x10 * x1 ! power.irp.f_shell_44: power_11 = x11 ! power.irp.f_shell_44: end ! power.irp.f_shell_44: real function power_12(x1) ! power.irp.f_shell_44: character*(8) :: irp_here = 'power_12' ! power.irp.f_shell_44: real, intent(in) :: x1 ! power.irp.f_shell_44: real :: x2, x3, x6, x12 ! power.irp.f_shell_44:
55
82 83 84 86 86 87 88 89 90 91 92 93 94 95 97 97 98 99
x2 = x1 * x1 ! power.irp.f_shell_44: 100 x3 = x2 * x1 ! power.irp.f_shell_44: 101 x6 = x3 * x3 ! power.irp.f_shell_44: 102 x12 = x6 * x6 ! power.irp.f_shell_44: 103 power_12 = x12 ! power.irp.f_shell_44: 104 end ! power.irp.f_shell_44: 105 real function power_13(x1) ! power.irp.f_shell_44: 107 character*(8) :: irp_here = 'power_13' ! power.irp.f_shell_44: 107 real, intent(in) :: x1 ! power.irp.f_shell_44: 108 real :: x2, x3, x6, x12, x13 ! power.irp.f_shell_44: 109 x2 = x1 * x1 ! power.irp.f_shell_44: 110 x3 = x2 * x1 ! power.irp.f_shell_44: 111 x6 = x3 * x3 ! power.irp.f_shell_44: 112 x12 = x6 * x6 ! power.irp.f_shell_44: 113 x13 = x12 * x1 ! power.irp.f_shell_44: 114 power_13 = x13 ! power.irp.f_shell_44: 115 end ! power.irp.f_shell_44: 116 real function power_14(x1) ! power.irp.f_shell_44: 118
56
character*(8) :: irp_here = 'power_14' ! power.irp.f_shell_44: 118 real, intent(in) :: x1 ! power.irp.f_shell_44: 119 real :: x2, x3, x6, x7, x14 ! power.irp.f_shell_44: 120 x2 = x1 * x1 ! power.irp.f_shell_44: 121 x3 = x2 * x1 ! power.irp.f_shell_44: 122 x6 = x3 * x3 ! power.irp.f_shell_44: 123 x7 = x6 * x1 ! power.irp.f_shell_44: 124 x14 = x7 * x7 ! power.irp.f_shell_44: 125 power_14 = x14 ! power.irp.f_shell_44: 126 end ! power.irp.f_shell_44: 127 real function power_15(x1) ! power.irp.f_shell_44: 129 character*(8) :: irp_here = 'power_15' ! power.irp.f_shell_44: 129 real, intent(in) :: x1 ! power.irp.f_shell_44: 130 real :: x2, x3, x6, x7, x14, x15 ! power.irp.f_shell_44: 131 x2 = x1 * x1 ! power.irp.f_shell_44: 132 x3 = x2 * x1 ! power.irp.f_shell_44: 133 x6 = x3 * x3 ! power.irp.f_shell_44: 134 x7 = x6 * x1 ! power.irp.f_shell_44: 135
57
x14 = x7 * x7 ! power.irp.f_shell_44: 136 x15 = x14 * x1 ! power.irp.f_shell_44: 137 power_15 = x15 ! power.irp.f_shell_44: 138 end ! power.irp.f_shell_44: 139 real function power_16(x1) ! power.irp.f_shell_44: 141 character*(8) :: irp_here = 'power_16' ! power.irp.f_shell_44: 141 real, intent(in) :: x1 ! power.irp.f_shell_44: 142 real :: x2, x4, x8, x16 ! power.irp.f_shell_44: 143 x2 = x1 * x1 ! power.irp.f_shell_44: 144 x4 = x2 * x2 ! power.irp.f_shell_44: 145 x8 = x4 * x4 ! power.irp.f_shell_44: 146 x16 = x8 * x8 ! power.irp.f_shell_44: 147 power_16 = x16 ! power.irp.f_shell_44: 148 end ! power.irp.f_shell_44: 149 real function power_17(x1) ! power.irp.f_shell_44: 151 character*(8) :: irp_here = 'power_17' ! power.irp.f_shell_44: 151 real, intent(in) :: x1 ! power.irp.f_shell_44: 152 real :: x2, x4, x8, x16, x17 ! power.irp.f_shell_44: 153
58
x2 = x1 * x1 ! power.irp.f_shell_44: 154 x4 = x2 * x2 ! power.irp.f_shell_44: 155 x8 = x4 * x4 ! power.irp.f_shell_44: 156 x16 = x8 * x8 ! power.irp.f_shell_44: 157 x17 = x16 * x1 ! power.irp.f_shell_44: 158 power_17 = x17 ! power.irp.f_shell_44: 159 end ! power.irp.f_shell_44: 160 real function power_18(x1) ! power.irp.f_shell_44: 162 character*(8) :: irp_here = 'power_18' ! power.irp.f_shell_44: 162 real, intent(in) :: x1 ! power.irp.f_shell_44: 163 real :: x2, x4, x8, x9, x18 ! power.irp.f_shell_44: 164 x2 = x1 * x1 ! power.irp.f_shell_44: 165 x4 = x2 * x2 ! power.irp.f_shell_44: 166 x8 = x4 * x4 ! power.irp.f_shell_44: 167 x9 = x8 * x1 ! power.irp.f_shell_44: 168 x18 = x9 * x9 ! power.irp.f_shell_44: 169 power_18 = x18 ! power.irp.f_shell_44: 170 end ! power.irp.f_shell_44: 171
59
real function power_19(x1) ! power.irp.f_shell_44: 173 character*(8) :: irp_here = 'power_19' ! power.irp.f_shell_44: 173 real, intent(in) :: x1 ! power.irp.f_shell_44: 174 real :: x2, x4, x8, x9, x18, x19 ! power.irp.f_shell_44: 175 x2 = x1 * x1 ! power.irp.f_shell_44: 176 x4 = x2 * x2 ! power.irp.f_shell_44: 177 x8 = x4 * x4 ! power.irp.f_shell_44: 178 x9 = x8 * x1 ! power.irp.f_shell_44: 179 x18 = x9 * x9 ! power.irp.f_shell_44: 180 x19 = x18 * x1 ! power.irp.f_shell_44: 181 power_19 = x19 ! power.irp.f_shell_44: 182 end ! power.irp.f_shell_44: 183 real function power_20(x1) ! power.irp.f_shell_44: 185 character*(8) :: irp_here = 'power_20' ! power.irp.f_shell_44: 185 real, intent(in) :: x1 ! power.irp.f_shell_44: 186 real :: x2, x4, x5, x10, x20 ! power.irp.f_shell_44: 187 x2 = x1 * x1 ! power.irp.f_shell_44: 188 x4 = x2 * x2 ! power.irp.f_shell_44: 189
60
x5 = x4 * x1 x10 = x5 * x5 x20 = x10 * x10 power_20 = x20 end
61
! ! ! ! !
power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44: power.irp.f_shell_44:
190 191 192 193 194
IRPF90 for HPC Using the align option, IRPF90 can introduce compiler directives for the Intel Fortran compiler, such that all the arrays are aligned. The $IRP_ALIGN variable corresponds to this alignment. For example, integer function align_double(i) integer, intent(in) :: i integer :: j j = mod(i,max($IRP_ALIGN,4)/4) if (j==0) then align_double = i else align_double = i+4j endif end
62
BEGIN_PROVIDER [ integer, n ] &BEGIN_PROVIDER [ integer, n_aligned ] integer :: align_double n = 19 n_aligned = align_double(19) END_PROVIDER BEGIN_PROVIDER [ double precision, Matrix, (n_aligned,n) ] Matrix = 0.d0 END_PROVIDER program test print *, size(Matrix,1), size(Matrix,2) end Generated code without alignment: ! * F90 *! 63
!! ! This file was generated with the irpf90 tool. ! ! ! ! DO NOT MODIFY IT BY HAND ! !! module matrix_mod double precision, allocatable :: matrix (:,:) logical :: matrix_is_built = .False. integer :: n_aligned integer :: n logical :: n_is_built = .False. end module matrix_mod ! * F90 *! !! ! This file was generated with the irpf90 tool. ! 64
! ! ! DO NOT MODIFY IT BY HAND ! !! subroutine provide_matrix use matrix_mod implicit none character*(14) :: irp_here = 'provide_matrix' integer :: irp_err logical :: irp_dimensions_OK if (.not.n_is_built) then call provide_n endif if (allocated (matrix) ) then irp_dimensions_OK = .True. irp_dimensions_OK = irp_dimensions_OK.AND.(SIZE(matrix,1)==(n_aligned)) irp_dimensions_OK = irp_dimensions_OK.AND.(SIZE(matrix,2)==(n)) if (.not.irp_dimensions_OK) then
deallocate(matrix,stat=irp_err) if (irp_err /= 0) then
65
print *, irp_here//': Deallocation failed: matrix' print *, ' size: (n_aligned,n)' endif if ((n_aligned>0).and.(n>0)) then allocate(matrix(n_aligned,n),stat=irp_err) if (irp_err /= 0) then print *, irp_here//': Allocation failed: matrix' print *, ' size: (n_aligned,n)' endif endif endif else if ((n_aligned>0).and.(n>0)) then allocate(matrix(n_aligned,n),stat=irp_err) if (irp_err /= 0) then print *, irp_here//': Allocation failed: matrix' print *, ' size: (n_aligned,n)' endif 66
endif endif if (.not.matrix_is_built) then call bld_matrix matrix_is_built = .True. endif end subroutine provide_matrix subroutine bld_matrix use matrix_mod character*(6) :: irp_here = 'matrix' Matrix = 0.d0 end subroutine bld_matrix subroutine provide_n use matrix_mod implicit none character*(9) :: irp_here = 'provide_n'
integer logical if (.not.n_is_built) then call bld_n n_is_built = .True. 67
! matrix.irp.f: ! matrix.irp.f:
:: irp_err :: irp_dimensions_OK
19 20
endif end subroutine provide_n subroutine bld_n use matrix_mod character*(1) :: irp_here = 'n' ! integer :: align_double ! n = 19 ! n_aligned = align_double(19) ! end subroutine bld_n integer function align_double(i) ! character*(12) :: irp_here = 'align_double'! integer, intent(in) :: i ! integer :: j ! j = mod(i,max(1,4)/4) ! if (j==0) then ! align_double = i !
68
matrix.irp.f: matrix.irp.f: matrix.irp.f: matrix.irp.f:
12 14 15 16
matrix.irp.f: matrix.irp.f: matrix.irp.f: matrix.irp.f: matrix.irp.f: matrix.irp.f: matrix.irp.f:
1 1 2 3 4 5 6
else align_double = i+4j endif end
! ! ! !
matrix.irp.f: matrix.irp.f: matrix.irp.f: matrix.irp.f:
Output: $ ./test 19
19
Generated code with an alignment of 32 bytes: ! * F90 *! !! ! This file was generated with the irpf90 tool. ! ! ! ! DO NOT MODIFY IT BY HAND ! !!
69
7 8 9 10
module matrix_mod double precision, allocatable :: matrix (:,:) !DIR$ ATTRIBUTES ALIGN: 32 :: matrix logical :: matrix_is_built = .False. integer :: n_aligned integer :: n logical :: n_is_built = .False. end module matrix_mod ! * F90 *! !! ! This file was generated with the irpf90 tool. ! ! ! ! DO NOT MODIFY IT BY HAND ! !! subroutine provide_matrix 70
use matrix_mod implicit none character*(14) :: irp_here = 'provide_matrix' integer :: irp_err logical :: irp_dimensions_OK if (.not.n_is_built) then call provide_n endif if (allocated (matrix) ) then irp_dimensions_OK = .True. irp_dimensions_OK = irp_dimensions_OK.AND.(SIZE(matrix,1)==(n_aligned)) irp_dimensions_OK = irp_dimensions_OK.AND.(SIZE(matrix,2)==(n)) if (.not.irp_dimensions_OK) then deallocate(matrix,stat=irp_err) if (irp_err /= 0) then print *, irp_here//': Deallocation failed: matrix' print *, ' size: (n_aligned,n)' endif
if ((n_aligned>0).and.(n>0)) then allocate(matrix(n_aligned,n),stat=irp_err)
71
if (irp_err /= 0) then print *, irp_here//': Allocation failed: matrix' print *, ' size: (n_aligned,n)' endif endif endif else if ((n_aligned>0).and.(n>0)) then allocate(matrix(n_aligned,n),stat=irp_err) if (irp_err /= 0) then print *, irp_here//': Allocation failed: matrix' print *, ' size: (n_aligned,n)' endif endif endif if (.not.matrix_is_built) then call bld_matrix matrix_is_built = .True. 72
endif end subroutine provide_matrix subroutine bld_matrix use matrix_mod character*(6) :: irp_here Matrix = 0.d0 end subroutine bld_matrix subroutine provide_n use matrix_mod implicit none character*(9) :: irp_here integer logical if (.not.n_is_built) then call bld_n n_is_built = .True.
= 'matrix'
= 'provide_n' :: irp_err :: irp_dimensions_OK
endif end subroutine provide_n subroutine bld_n 73
! matrix.irp.f: ! matrix.irp.f:
19 20
use matrix_mod character*(1) :: irp_here = 'n' ! integer :: align_double ! n = 19 ! n_aligned = align_double(19) ! end subroutine bld_n integer function align_double(i) ! character*(12) :: irp_here = 'align_double'! integer, intent(in) :: i ! integer :: j ! j = mod(i,max(32,4)/4) ! if (j==0) then ! align_double = i ! else ! align_double = i+4j ! endif ! end !
Output:
74
matrix.irp.f: matrix.irp.f: matrix.irp.f: matrix.irp.f:
12 14 15 16
matrix.irp.f: matrix.irp.f: matrix.irp.f: matrix.irp.f: matrix.irp.f: matrix.irp.f: matrix.irp.f: matrix.irp.f: matrix.irp.f: matrix.irp.f: matrix.irp.f:
1 1 2 3 4 5 6 7 8 9 10
$ ./test 20
19
To remove all compiler directives introduced by the programmer, it is possible to use irpf90 nodirectives.
75
More about IRPF90 • ArXiv: http://arxiv.org/abs/0909.5012 • Web site: http://irpf90.upstlse.fr
76
Single core optimization in QMC=Chem Anthony Scemama Michel Caffarel Labratoire de Chimie et Physique Quantiques IRSAMC (Toulouse)
Hardware considerations Intel(R) Xeon(R) CPU E31220 @ 3.10GHz 3.4GHz turbo, 8 MiB shared L3, 256 KiB L2, 32 KiB L1 • The AVX instruction set allows to perform vector operations on 256 bits : 8 single precision (SP) elements or 4 double precision (DP) elements • The vector ADD and MUL operations have a throughput of 1 per cycle (pipelining) • One vector ADD, one vector MUL and one integer ADD (loop count) can be performed simultaneously • Therefore, the peak performance of an Intel Sandy/Ivy Bridge core is 16 floating point operations (flops) per cycle (SP) or 8 flops/cycle (DP) • One E31220 core has a peak performance of 54.4 Gflops/s (SP), 27.2 Gflops/s (DP) • In the peak regime, one flop takes in average 0.018 ns (SP) or 0.037 ns (DP) 1
On modern architectures, reducing the number of flops does not systematically reduce the execution time. Memory access is much more critical. Understanding how the data arrives to the CPU helps to write efficient code *.
*
2
"What Every Programmer Should Know About Memory, U. Drepper, (2007), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.957
Measures obtained with LMbench † 1 cycle = 0.29 ns, 1 peak flop SP = 0.018 ns Integer (ns)
bit
ADD
MUL
DIV
MOD
32 bit 64 bit
0.3 0.3
0.04 0.04
0.9 0.9
6.7 13.2
7.7 12.9
Floating Point (ns)
ADD
MUL
DIV
32 bit 64 bit
0.9 0.9
1.5 1.5
4.4 6.8
Data read (ns)
Random
Prefetched
L1 L2 L3 Memory
1.18 3.5 13 7580
1.18 1.6 1.7
† 3
http://www.bitmover.com/lmbench/
3.
nanoseconds
85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0
Random read Stride=16 Stride=128 Stride=256
10
100
1000
10000 KiB
4
100000
1e+06
• Random access to memory is very slow : 79 ns = 270 CPU cycles : 4300 flops (Peak SP) • Strided access to memory with a stride < 4096 KiB (1 page) triggers the hardware prefetchers, reducing the memory latencies. Smaller strides are better, and give latencies comparable to L2 latencies. Recomputing data may be faster than fetching it randomly in memory Other important numbers:
5
Mutex lock/unlock
~100 ns
Infiniband
~1 200 ns
Ethernet
~50 000 ns
Disk seek (SSD)
~50 000 ns
Disk seek (15k rpm)
~2 000 000 ns
Example : squared distance matrix do j=1,n do i=1,j dist1(i,j) = X(i,1)*X(j,1) + X(i,2)*X(j,2) + X(i,3)*X(j,3) end do end do do j=1,n do i=j+1,n dist1(i,j) = dist1(j,i) end do end do
t( n=133 ) = 13.0 s, 3.0 GFlops/s t( n=4125 ) = 95.4 ms, 0.44 GFlops/s
6
do j=1,n do i=1,j dist2(i,j) = X(i,1)*X(j,1) + X(i,2)*X(j,2) + X(i,3)*X(j,3) dist2(j,i) = dist2(i,j) end do end do
t( n=133 ) = 11.5 s : 1.13x speedup, 3.5 GFlops/s t( n=4125 ) = 90.4 ms : 1.05x speedup, 0.47 GFlops/s do j=1,n do i=1,n dist3(i,j) = X(i,1)*X(j,1) + X(i,2)*X(j,2) + X(i,3)*X(j,3) end do end do
2x more flops! t( n=133 ) = 10.3 s : 1.12x speed up, 8.2 GFlops/s t( n=4125 ) = 15.7 ms : 5.75x speed up, 5.4 GFlops/s 7
Vector operations AVX single instruction / multiple data (SIMD) instructions operate on 256bit floating point registers:
Example : vector ADD in double precision:
Requirements: • The elements of each SIMD vector must be contiguous in memory • The first element of each SIMD vector must be aligned on a 32 byte boundary
8
Automatic vectorization The compiler can generate automatically vector instructions when possible. An autovectorized loop generates 3 loops: Peel loop (scalar) First elements until the 32 byte boundary is met Vector loop Vectorized version until the last vector of 4 elements Tail loop (scalar) Last elements
9
10
Intel specific Compiler directives To remove the peel loop, you can tell the compiler to align the arrays on a 32 byte boundary using: double precision, allocatable :: A(:), B(:) !DIR$ ATTRIBUTES ALIGN : 32 :: A, B Then, before using the arrays in a loop, you can tell the compiler that the arrays are aligned. Be careful: if one array is not aligned, this may cause a segmentation fault. !DIR$ VECTOR ALIGNED do i=1,n A(i) = A(i) + B(i) end do To remove the tail loop, you can allocate A such that its dimension is a multiple of 4 elements:
11
n_4 = mod(n,4) if (n_4 == 0) then n_4 = n else n_4 = n  n_4 + 4 endif allocate ( A(n_4), B(n_4) ) and rewrite the loop as follows: do i=1,n,4 !DIR$ VECTOR ALIGNED !DIR$ VECTOR ALWAYS do k=0,3 A(i+k) = A(i+k) + B(i+k) end do end do
12
In that case, the compiler knows that each innermost loop cycle can be transformed safely into only vector instructions, and it will not produce the tail and peel loops with the branching. For small arrays, the gain can be significant. For multidimensional arrays, if the 1st dimension is a multiple of 4 elements, all the columns are aligned: double precision, allocatable :: A(:,:) !DIR$ ATTRIBUTES ALIGN : 32 :: A allocate( A(n_4,m) ) do j=1,m do i=1,n,4 !DIR$ VECTOR ALIGNED !DIR$ VECTOR ALWAYS do k=0,3 A(i+k,j) = A(i+k,j) * B(i+k,j) end do end do end do
13
Warning : In practice, using multiples of 4 elements is not always the best choice. Using multiples of 8 or 16 elements can be better because the innermost loop may be unrolled by the compiler to improve the efficiency of the pipeline.
14
Example : squared distance matrix do j=1,n do i=1,n,8 !DIR$ VECTOR ALIGNED !DIR$ VECTOR ALWAYS do k=0,7 dist4(i+k,j) = X(i+k,1)*X(j,1) + X(i+k,2)*X(j,2) + X(i+k,3)*X(j,3) end do end do end do
t( n=133 ) = 7.2 s : 1.44x speedup, 12.1 GFlops/s t( n=4125 ) = 15.5 ms : 1.01x speedup, 7.5 GFlops/s.
15
Hot spots of QMC algorithms At every Monte Carlo step, the following quantities have to be computed:
• Slater matrix:
•
• 16
Calculation of the Slater matrices The Slater matrices have to be computed, as well as their gradients and Laplacian. It is necessary to compute the values, gradients and Laplacian of the Molecular Orbitals (MOs) at the electron positions.
• C is the matrix of MO coefficients (constant) • A1 : MO values • B1 : AO values • A2, A3, A4 : MO gradients (x,y,z) 17
• B2, B3, B4 : AO gradients (x,y,z) • A5 : of MO Laplacian • B5 : of AO Laplacian We need to compute Ai = C x Bi efficiently: • Single precision is sufficient • AOs are not orthonormal and centered on nuclei • All Bi have null elements where rri 2 is large : only nonzero elements are computed • All Bi are sparse with nonzero elements at the same indices • C is constant and dense • The size of Ai is small (~ Nelec / 2) • We have implemented a very efficient dense x sparse matrix product for small matrices
18
Dense Matrix x Sparse Vector Product To improve cache locality and reduce memory, we: • compute one column of all Bi and store them sparse • make the product of C with these vectors and store all Ai The sparse vectors are represented as:
19
20
In QMC=Chem, all arrays are aligned on a 32 byte boundary by IRPF90. The leading dimension is always a multiple of 8 elements. ! $IRP_ALIGN = 32 ! $IRP_ALIGN/41 = 7 ! Initialize output vectors ! !DIR$ VECTOR ALIGNED do j=1,LDA,max(1,$IRP_ALIGN/4) !DIR$ VECTOR ALIGNED A1(j:j+$IRP_ALIGN/41) = 0. !DIR$ VECTOR ALIGNED A2(j:j+$IRP_ALIGN/41) = 0. !DIR$ VECTOR ALIGNED A3(j:j+$IRP_ALIGN/41) = 0. !DIR$ VECTOR ALIGNED A4(j:j+$IRP_ALIGN/41) = 0. 21
!DIR$ VECTOR ALIGNED A5(j:j+$IRP_ALIGN/41) = 0. enddo ! Unroll and jam x 4 ! kmax2 = indices(0)mod(indices(0),4) do kao=1,kmax2,4 ! Fetch column indices ! k_vec(1) k_vec(2) k_vec(3) k_vec(4)
22
= = = =
indices(kao ) indices(kao+1) indices(kao+2) indices(kao+3)
! Fetch column factors (1,2) ! d11 d21 d31 d41
= = = =
B1(kao ) B1(kao+1) B1(kao+2) B1(kao+3)
d12 d22 d32 d42
= = = =
B2(kao ) B2(kao+1) B2(kao+2) B2(kao+3)
! A += C x B (1,2) ! !DIR$ VECTOR ALIGNED !DIR$ LOOP COUNT (256) 23
do j=1,LDA,max(1,$IRP_ALIGN/4) !DIR$ VECTOR ALIGNED A1(j:j+$IRP_ALIGN/41) = A1(j:j+$IRP_ALIGN/41) + & C(j:j+$IRP_ALIGN/41,k_vec(1))*d11 + & C(j:j+$IRP_ALIGN/41,k_vec(2))*d21 + & C(j:j+$IRP_ALIGN/41,k_vec(3))*d31 + & C(j:j+$IRP_ALIGN/41,k_vec(4))*d41 !DIR$ VECTOR ALIGNED A2(j:j+$IRP_ALIGN/41) = A2(j:j+$IRP_ALIGN/41) + & C(j:j+$IRP_ALIGN/41,k_vec(1))*d12 + & C(j:j+$IRP_ALIGN/41,k_vec(2))*d22 + & C(j:j+$IRP_ALIGN/41,k_vec(3))*d32 + & C(j:j+$IRP_ALIGN/41,k_vec(4))*d42 enddo
24
! Fetch column factors (3,4) ! d13 d23 d33 d43
= = = =
B3(kao ) B3(kao+1) B3(kao+2) B3(kao+3)
d14 d24 d34 d44
= = = =
B4(kao ) B4(kao+1) B4(kao+2) B4(kao+3)
! A = C x B (3,4) ! !DIR$ VECTOR ALIGNED do j=1,LDA,max(1,$IRP_ALIGN/4) 25
!DIR$ VECTOR ALIGNED A3(j:j+$IRP_ALIGN/41) = A3(j:j+$IRP_ALIGN/41) + & C(j:j+$IRP_ALIGN/41,k_vec(1))*d13 + & C(j:j+$IRP_ALIGN/41,k_vec(2))*d23 + & C(j:j+$IRP_ALIGN/41,k_vec(3))*d33 + & C(j:j+$IRP_ALIGN/41,k_vec(4))*d43 !DIR$ VECTOR ALIGNED A4(j:j+$IRP_ALIGN/41) = A4(j:j+$IRP_ALIGN/41) + & C(j:j+$IRP_ALIGN/41,k_vec(1))*d14 + & C(j:j+$IRP_ALIGN/41,k_vec(2))*d24 + & C(j:j+$IRP_ALIGN/41,k_vec(3))*d34 + & C(j:j+$IRP_ALIGN/41,k_vec(4))*d44 enddo ! Fetch column factors (5) 26
! d15 d25 d35 d45
= = = =
B5(kao ) B5(kao+1) B5(kao+2) B5(kao+3)
! A += C x B (5), unrolled 2x by compiler ! !DIR$ VECTOR ALIGNED do j=1,LDA,max(1,$IRP_ALIGN/4)
! Unroll 2 times
!DIR$ VECTOR ALIGNED A5(j:j+$IRP_ALIGN/41) = A5(j:j+$IRP_ALIGN/41) + & C(j:j+$IRP_ALIGN/41,k_vec(1))*d15 + & C(j:j+$IRP_ALIGN/41,k_vec(2))*d25 + & 27
C(j:j+$IRP_ALIGN/41,k_vec(3))*d35 + & C(j:j+$IRP_ALIGN/41,k_vec(4))*d45 enddo enddo ! Tail loop of outer loop ! do kao = kmax2+1, indices(0) ! Fetch column indice ! k_vec(1) = indices(kao)
28
! Fetch column factors (15) ! d11 d12 d13 d14 d15
= = = = =
B1(kao) B2(kao) B3(kao) B4(kao) B5(kao)
! A += B x C (15) ! !DIR$ VECTOR ALIGNED do j=1,LDA,max(1,$IRP_ALIGN/4) !DIR$ VECTOR ALIGNED A1(j:j+$IRP_ALIGN/41) = A1(j:j+$IRP_ALIGN/41) + & 29
C(j:j+$IRP_ALIGN/41,k_vec(1))*d11 !DIR$ VECTOR ALIGNED A2(j:j+$IRP_ALIGN/41) = A2(j:j+$IRP_ALIGN/41) + & C(j:j+$IRP_ALIGN/41,k_vec(1))*d12 !DIR$ VECTOR ALIGNED A3(j:j+$IRP_ALIGN/41) = A3(j:j+$IRP_ALIGN/41) + & C(j:j+$IRP_ALIGN/41,k_vec(1))*d13 !DIR$ VECTOR ALIGNED A4(j:j+$IRP_ALIGN/41) = A4(j:j+$IRP_ALIGN/41) + & C(j:j+$IRP_ALIGN/41,k_vec(1))*d14 !DIR$ VECTOR ALIGNED A5(j:j+$IRP_ALIGN/41) = A5(j:j+$IRP_ALIGN/41) + & C(j:j+$IRP_ALIGN/41,k_vec(1))*d15
30
enddo enddo Innermost loops: • Perfect ADD/MUL balance • Does not saturate load/store units • Only vector operations with no peel/tail loops • Uses 15 AVX registers. No register spilling • If all data fits in L1, 100% peak is reached (16 flops/cycle) • In practice: memory bound, so 5060% peak is measured.
31
Other QMC codes use 3D splines to avoid the computation of AOs, and the matrix products but: • To do the 3D interpolation, 8 values are needed (corners of a cube) • This represents 4 random memory accesses : ~320 nanoseconds + the time to compute the interpolation • In 360 nanoseconds, we can do ~ 12 000 flops. • The average computation time of 1 element with our matrix product is proportional to the number of nonzero elements in Bi (<500 flops). • We have shown that our implementation (calculation of Ai and matrix products) is faster than the interpolation by factors of 1.0x to 1.5x • 3D splines have to be precomputed on a grid. It takes initialization time • 3D splines needs many GiB of RAM, so only small systems can be handled, and OpenMP parallelism is often required.
32
Inverse Slater matrices To compute •
and
, one needs the inverse of the Slater matrices:
needs
• needs A unique list of alpha and beta Slater matrices is generated, and the results are combined at the end to produce the alpha x beta determinant products. To give accurate results, double precision is required. For each spin, the first inverse Slater matrix is fully calculated: • < 5x5 : handwritten O(N!) algorithm (5! < 53) • > 5x5 : MKL library : dgetrf, dgetri The next determinants are calculated using the ShermannMorissonWoodbury formula : if only one column differs, the new inverse can be computed in O(N2). 33
A CASSCF wave function with 10 000 determinant products has 100 unique alpha and 100 unique beta determinants. One of those will be computed in O(N3) and all others will be computed in O(N2). For a 40 electrons system (20 alpha, 20 beta), computing 10 000 determinants will be only ~6x longer than the singledeterminant calculation.
34
Tutorial: Quantum Monte Carlo with QMC=Chem In this tutorial, We will study the dissociation energy of N2 using a HartreeFock (HF) trial wave function and a complete active space (CASSCF) trial wave function. The HF wave function for dissociated N2 is computed within restricted openshell HF with a spin multiplicity of 7. These wave functions were prepared using the GAMESS 1 program. The dissociation energy is evaluated by calculating the energy difference of N2 at and at , where is the interatomic distance. In your directory, you should have: $ ls 1.1.CAS/ 1.1.CAS.out
1.1.HF/ 1.1.HF.160core.sub
1.1.HF.1core.sub 4.CAS/ 1.1.HF.out 4.CAS.out
4.HF/ 4.HF.out
• The *.out files are the GAMESS output files • The *.sub files are the files needed to submit a job • The directories are the corresponding QMC=Chem EZFIO database files 2
Running a VMC calculation Single core run To access the input data, run $ qmcchem_edit.py 1.1.HF This command will open a temporary file containing the different parameters of the simulation. Modify them as follows: # Simulation # end_condition jastrow method nucl_fitcusp num_step sampling time_step title walk_num
= = = = = = = = =
"wall_time > 300" False "VMC" True 10000 "Langevin" 0.2 "HF, 1.1 angstroms" 20
end_condition Stopping condition of the run. 5 minutes is fine. jastrow If true, use a Jastrow factor to improve the trial wave function. In this tutorial, we will not use it. method VMC: Variational Monte Carlo. nucl_fitcusp Impose the correct electronnucleus cusp at the nucleus to avoid the divergence of the energy at the nuclei. This doesn't change the energy but considerably reduces its variance.
num_step Number of steps per block. This is usually adjusted such that the time spent to compute one block is not too small or not too large. Typically, for very short runs 20 seconds is OK, and for usual production runs, this parameters is adjusted to 10 minutes per block. sampling The Monte Carlo sampling algorithm. Langevin is the best for VMC. time_step Simulation time step. Using the Langevin algorithm, 0.2 is usually a good choice. title Title of the run. You can put whatever you want. walk_num Number of walkers (independent trajectories in VMC). When you save the fiel and exit the text editor, the EZFIO database has been updated. You can now run the QMC calculation. First, run a singlecore run: $ ccc_msub A 1.1.HF.1core.sub In QMC=Chem, there is no output file. At any time, you can see what has been computed by running: $ qmcchem_result.py s 1.1.HF The output of this command should look like this: # Summary #Number of blocks : 26 Number of blocks per core : 26 Total CPU time : 0:04:51 CPU time / block : 11.192(79) Acceptance rate : 0.92348(10) #e_loc : 108.9842(52) Variance of e_loc : 25.75(30) Min of e_loc : 351.467898466 Max of e_loc : 503.693209743 #Number of blocks Total number of blocks in the database. Number of blocks per core Each CPU core has individually made this number of blocks. Total CPU time Sum of the CPU times of all the cores. CPU time / block Average CPU time per block. Acceptance rate Average Metropolis acceptance rate. e_loc Average of the local energy.
Variance of e_loc Variance ( ) of the local energy. Min/Max of e_loc Min or Max value of the local energy encountered in the simulation. At the end of the run, check that the average of the local energy corresponds to the HartreeFock energy given by GAMESS (within the error bars). You can plot the convergence of the local energy using: $ qmcchem_result.py p E_loc 1.1.HF
The blue curve is the convergence plot of the local energy by cumulating blocks from the first block to the last block. The red curve is the same convergence plot but using the blocks from the last one to the first one. If the calculation is converged, the blocks are independent between each other and the shape of the curve should not depend on the order in which the blocks are taken. If the blue and the red convergence plots are not compatible, the QMC run is not converged.
Multicore run Your first calculation has finished. If you want, you can add more blocks to the EZFIO database. To do this, run a calculation in parallel using $ ccc_msub A 1.1.HF.160core.sub Check that the error bar is significantly reduced, and that the total CPU time is 160x larger: $ qmcchem_result.py s c 1.1.HF # Summary #Number of blocks : 3168 Number of blocks per core : 27 Total CPU time : 9:49:16 CPU time / block : 11.292(19) Acceptance rate : 0.923571(13) #e_loc : 108.98489(56) Variance of e_loc : 26.082(58) Min of e_loc : 397.069319894
Max of e_loc : 4547.98196953 #Now, you have enough blocks to verify that the blocks have a Gaussian distribution: $ qmcchem_result.py H E_loc 1.1.HF
Run VMC calculations for the 3 other trial wave functions and check that the energy corresponds to the energy given by GAMESS.
Running DMC calculations For each EZFIO directory, modify the simulation parameters as follows: $ qmcchem_edit.py 1.1.HF method sampling time_step title walk_num
= = = = =
"DMC" "Brownian" 0.0001 "1.1 HF DMC" 40
method Choose DMC to perform a Diffusion Monte Carlo calculation. sampling Choose the Brownian motion for DMC. Langevin is not adapted. time_step With the Brownian motion, this time step is sufficiently small to obtain a small time step error, and the Metropolis acceptance rate is close to 99.9%. walk_num We use a DMC algorithm with a fixed number of walkers with no population control bias. The counter part is that with a small number of walkers, additional fluctuations of the local energy are introduced. It is preferable to increase the number of walkers for the DMC calculation. This set of parameters is fine for all the runs. As the effective time step is approximately 10 times less than in VMC, the total computational time to obtain an error bar comparable to the error bar obtained in VMC will be 10 times longer. Run a first short DMC calculation with a small number of cores (typically one node), such that the walkers move from the VMC distribution to the DMC distribution. Clear the computed data by uncommenting clear(blocks), since this calculation it is not well converged:
$ qmcchem_edit.py 1.1.HF # Clear # # clear(all_blocks) clear(blocks) # clear(jastrow) # clear(walkers) Then, run a longer calculation on 10 nodes. You should obtain these energies: Nodes
R
(DMC) (a.u)
HF
1.1
109.4869(63)
4.0
109.1498(76)
1.1
109.5094(66)
4.0
109.1360(61)
CAS
Dissociation energies: W.F.
Delta E (a.u)
HF
0.1883
CAS
0.3246
DMC/HF
0.3371(98)
DMC/CAS
0.3734(90)
Exact
0.3632
Adding a new property In this section, we will modify the sources of QMC=Chem to compute a new property. The 3D space is partitioned in two subspaces separated by the plane perpendicular to the NN bond. We will compute the probability to find 1,2,3,4,...,14 electrons in one subspace. The corresponding local operator is implemented as an array P(elec_num). P(m) = 1.d0 where m is the number of electrons in the subspace, and P = 0.d0 eleswhere. The average of this operator will give the probability of finding 1,2,3,4,...,14 electrons in the subspace.
Adding the property to the sources First, go into the QMC=Chem source directory: $ cd ${QMCCHEM_PATH}/src Create a new file, named properties_cecam.irp.f with the following content: !==========================================================================! ! PROPERTIES !==========================================================================!
BEGIN_PROVIDER [ double precision, proba_N2, (14) ] implicit none BEGIN_DOC ! Probability of finding N electrons on one N atom in N2 END_DOC integer :: i, n n = 0 proba_N2 = 0.d0 do i=1,elec_num if (elec_coord(i,3) > 0.d0) then n += 1 endif enddo if (n>0) then proba_N2(n) = 0.5d0 proba_N2(elec_numn) = 0.5d0 endif
END_PROVIDER Do not remove the 3 first commented lines: they are used by an embedded shell script to detect that what follows are properties to compute. Then, build the program: $ cd ${QMCCHEM_PATH} $ make Before runnning tests, we will have to restore the VMC parameters in our EZFIO databases.
Restoring the VMC configuration QMC=Chem keeps track of all the modifications if the EZIO database: $ qmcchem_log.py 1.1.HF  Date  MD5  1  20130708 14:05:39  97395378eaa00b194de0536dbd172153  Edit 2  20130708 14:05:42  97395378eaa00b194de0536dbd172153  Generate new walkers 3  20130708 14:05:43  97395378eaa00b194de0536dbd172153  Start run 4  20130708 14:06:08  97395378eaa00b194de0536dbd172153  Stop run 5  20130708 14:06:28  f622a3fc6e35fc3a75717d43e1b84de2  Edit 6  20130708 14:06:36  9e8b5122372c7f6e7698a8a55861131b  Edit 7  20130708 14:06:43  9e8b5122372c7f6e7698a8a55861131b  Clear all_blocks 8  20130708 15:41:20  295d77618e1507bbd3152d6e33610ddb  Start run 9  20130708 15:43:28  295d77618e1507bbd3152d6e33610ddb  Stop run 10  20130709 11:38:40  93b358fb4ead5aa061b40c2807ea0e73  Edit 11  20130709 11:38:59  93b358fb4ead5aa061b40c2807ea0e73  Start run 12  20130709 11:44:07  93b358fb4ead5aa061b40c2807ea0e73  Stop run
From this data, you can identify that the DMC run should be at step number 10, as the MD5 key has changed. To verify this, run: $ qmcchem_log.py log 10 1.1.HF Date : 20130709 11:38:40
MD5 : Description :
93b358fb4ead5aa061b40c2807ea0e73 Edit
Wave function ============= N_atoms = 2 N_electrons = 14 (7 alpha, 7 beta) N_det = 1 N_MOs = 60 N_AOs = 70 no Jastrow nuclear cusp fitting DMC ==== time_step sampling N_steps N_walkers
= = = =
0.0001 Brownian 10000 40
Modified ======== simulation/http_server simulation/time_step simulation/sampling electrons/elec_walk_num simulation/method simulation/title You see that it is a DMC run, and that simulation/method has been modified from the previous step. This confirms it is the first DMC calculation. Now, you can check out the configuration of the VMC run just before this DMC run: $ qmcchem_log.py checkout 9 1.1.HF Date : MD5 : Description :
20130709 23:22:29 295d77618e1507bbd3152d6e33610ddb Checked out 9
Wave function ============= N_atoms = 2 N_electrons = 14 (7 alpha, 7 beta) N_det = 1 N_MOs = 60 N_AOs = 70 no Jastrow nuclear cusp fitting VMC ====
time_step sampling N_steps N_walkers
= = = =
0.2 Langevin 10000 20
Modified ======== electrons/elec_coord.gz simulation/http_server simulation/time_step simulation/print_level simulation/sampling electrons/elec_walk_num simulation/method simulation/title You can verify that this corresponds to the VMC configuration.
Running the code with the new property to sample Now, when you run qmcchem_edit.py, a new item appears: # Properties # ... ( ) e_ref_weight ( ) proba_n2 ( ) voronoi_charges ... Activate the proba_n2 property by putting an X between the brackets: (X) proba_n2 Then, submit VMC calculations for both the HF and the CASSCF trial wave functions at results can be checked using: $ qmcchem_result.py t proba_n2 1.1.HF # proba_n2 ## Idx Average 1 0.000000 2 0.000000 3 0.000156(15) 4 0.01629(26) 5 0.09477(52) 6 0.23309(38) 7 0.15568(52) 8 0.23309(38) 9 0.09477(52)
The
10 11 12 13 14
0.01629(26) 0.000156(15) 0.000000 0.000000 0.000000
Now, check out the corresponding DMC calculations and sample the histograms. The DMC sampled quantities correspond to the mixed distribution . A firstorder approximation to the properties computed with can by obtained by Here are the expected probabilities P(n): n
HF
DMC(HF)
2DMCVMC(HF)
CASSCF
DMC(CASSCF) 2DMCVMC(CASSCF)
4
0.016
0.010
0.004
0.004
0.004
0.004
5
0.095
0.077
0.059
0.054
0.051
0.048
6
0.233
0.240
0.247
0.244
0.244
0.244
7
0.156
0.173
0.190
0.198
0.201
0.204
8
0.233
0.240
0.247
0.244
0.244
0.244
9
0.095
0.077
0.059
0.054
0.051
0.048
10
0.016
0.010
0.010
0.004
0.004
0.004
Going from the HF wave function to the CASSCF wave function tends to increase the weight of the neutral components (the probabilities of finding 7 electrons), which is expected. One can also remark that even with HF nodes, this is realized by the DMC algorithm. Using CASSCF nodes, the trial wave function has much better probabilities, and the DMC has less work to do. This shows that the the CASSCF nodes are much more physical than the HF nodes, and illustrates the difference observed in total energies when going from HF nodes to CASSCF nodes.
1 2
http://www.msg.ameslab.gov/gamess/ http://ezfio.sourceforge.net . EZFIO is the Easy Fortran I/O library generator written with IRPF90. The data is organized using the filesystem tree in plain text (eventually gzipped) files.
Michel Caffarel Quantum Monte Carlo Methods in Chemistry
SpringerReference
Quantum Monte Carlo Methods in Chemistry Synonyms and Acronyms Fixednode diffusion Monte Carlo (FNDMC); Green’s function Monte Carlo (GFMC); Pure diffusion Monte Carlo (PDMC); Reptation Monte Carlo (RMC); Stochastic reconfiguration Monte Carlo (SRMC); Variational Monte Carlo (VMC)
Description of the Problem The problem considered here is to obtain accurate solutions of the timeindependent Schrödinger equation for a general molecular system described as N electrons moving within the external potential of a set of fixed nuclei. This problem can be considered as the central problem of theoretical and computational chemistry. Using the atomic units adapted to the molecular scale the Schrödinger equation to solve can be written as (1)
where H is the Hamiltonian operator given by
(2)
the spatial positions of the N electrons,
electron i of coordinates
the Laplacian operator for
, Ψ the wavefunction, E the total energy (a real constant), and V the potential
energy function expressed as
(3)
In this formula vector position,
is the interelectronic distance, Zα the charge of nucleus α (a positive integer), , and
its
. The Schrödinger equation being invariant under
complex conjugation, we can restrict without loss of generality the eigensolutions to be realvalued. The boundary conditions are of Dirichlettype: Eigenfunctions Ψ are imposed to vanish whenever one electron (or more) goes to infinity (4)
In addition, the mathematical constraints resulting from the Pauli principle must be considered. Within a spaceonly formalism as employed in QMC, two types of electron – usually referred to as the “spinup” and “spindown” electrons – are distinguished and the Pauli principle is expressed as follows. Among all eigenfunctions verifying (1)–(4) only those that are antisymmetric under the exchange of any pair of spinlike electrons are physically allowed. Because of the permutational invariance, the N↑ spinup electrons can be arbitrarily chosen as those having the first labels and the mathematical conditions can be written as
(5a)
http://www.springerreference.com/index/chapterdbid/333776 © SpringerVerlag Berlin Heidelberg 2012
16 Jul 2012 13:27 1
Michel Caffarel Quantum Monte Carlo Methods in Chemistry
SpringerReference
and
(5b)
for all pairs (i, j) of spinlike electrons. Equations 1–5b define the mathematical problem discussed here. Although such a mathematical model results from a number of physical approximations, it contains the bulk of most chemical phenomena and solving it with enough accuracy (=chemical accuracy) can be considered as the major problem of computational chemistry. The two standard approaches to deal with the electronic structure problem in chemistry are the density functional theory (DFT) (Density Functional Theory) and the postHartree–Fock wavefunction approaches (PostHartree Fock Methods and Excited States Modelling, CoupledCluster Methods). Quantum Monte Carlo (QMC) presented here may be viewed as an alternative approach aiming at circumventing the limitations of these two wellestablished methods (for a detailed presentation of QMC, see, e.g., [1]). In contrast with these deterministic approaches, QMC is based on a stochastic sampling of the electronic configuration space. In the recent years, a number of remarkable applications have been presented, thus establishing QMC as a high potential approach although a number of limitations are still present. Here, we shall present the two most popular approaches used in chemistry, namely, the variational Monte Carlo (VMC) and the fixednode diffusion Monte Carlo (FNDMC) methods.
The Variational Monte Carlo (VMC) Method The variational Monte Carlo (VMC) method is the simpler and the most popular quantum Monte Carlo approach. From a mathematical point of view, VMC is a standard Markov chain Monte Carlo (MCMC) method. Introducing an approximate trial wavefunction
known in an analytic form (a good approximation of the unknown wavefunction),
the MetropolisHastings algorithm is used to generate sample points distributed in the 3 Ndimensional configuration space according to the quantummechanical probability density π associated with ΨT
(6)
where
is a compact notation representing the positions of the N electrons,
. Expectation
values corresponding to various physical properties can be rewritten as averages over π. As an important example, the total energy defined as
(7)
may be rewritten under the form (8)
where
is the local energy defined as
(9)
In VMC, the total energy is thus estimated as a simple average of the local energy over a sufficiently large number K of configurations
generated with the Monte Carlo procedure
http://www.springerreference.com/index/chapterdbid/333776 © SpringerVerlag Berlin Heidelberg 2012
16 Jul 2012 13:27 2
Michel Caffarel Quantum Monte Carlo Methods in Chemistry
SpringerReference
(10)
the estimator becoming exact as K goes to infinity with a statistical error decreasing as
. Properties other than
the energy can be obtained in a similar way. In the case of the energy, it can be shown that there exists a variational principle expressed as EVMC(ΨT)≥E0 for any ΨT, the equality being obtained for the exact groundstate wavefunction of energy E0. In addition, there also exists a zerovariance principle stating that the closer the trial wavefunction is from the exact solution, the smaller the fluctuations of the local energy are, the statistical error vanishing in the limit of an exact trial wavefunction. In practice, both principles – minimization of the energy and/or of the fluctuations of the local energy – are at the basis of the various approaches proposed for optimizing the parameters entering the trial wavefunction.
The Diffusion Monte Carlo (DMC) Method The fundamental idea is to introduce a formal mathematical connection between the quantum and stochastic worlds by introducing a fictitious time dynamics as follows
(11)
where t plays the role of a time variable,
, a timedependent real wavefunction, and ET, some constant reference energy. The solution of this equation is uniquely defined by the choice of the initial wavefunction, . Using the spectral decomposition of the selfadjoint (hermitic) Hamiltonian operator, the solution of (11) can be written as (12)
where the sum is performed over the complete set of the eigensolutions of the Hamiltonian operator (13)
and
.
As seen from (12) the knowledge of the timedependent solution of the Schrödinger equation allows to have direct access to information about the timeindependent eigensolutions,
. As an important example, the exact groundstate
wavefunction (corresponding to the smaller eigenvalue E0) can be obtained by considering the largetime limit of the timedependent wavefunction (14)
up to an unessential multiplicative factor. In practice, to have an efficient Monte Carlo simulation of the original timedependent equation, we need to employ some sort of importance sampling, that is, a practical scheme for sampling only the regions of the very highdimensional configuration space where the quantities to be averaged have a nonvanishing contribution. Here, it is realized by http://www.springerreference.com/index/chapterdbid/333776 © SpringerVerlag Berlin Heidelberg 2012
16 Jul 2012 13:27 3
Michel Caffarel Quantum Monte Carlo Methods in Chemistry
SpringerReference
introducing a trial wavefunction ΨT (usually optimized in a preliminary VMC step) and by defining a new timedependent density as follows (15)
The equation that π obeys can be derived without difficulty from (11) and (15), we get
(16)
where L is a forward FokkerPlanck operator defined as (see, e.g., [2])
(17)
and
the drift vector given by
(18)
In order to define a stepbystep Monte Carlo algorithm, the fundamental equation (16) is rewritten under the following equivalent integral form describing the evolution of the density during a time interval τ (19)
where K is the following integral kernel (or imaginarytime propagator) (20)
For an arbitrary value of τ, the kernel is not known. However, for small enough timestep accurate approximations of K can be obtained and sampled. To see this, let us first split the exponential operator into a product of exponentials by using the BakerCampbellHausdorff formulas [3] (21)
and then introduce a shorttime gaussian approximation of the FokkerPlanck kernel [2],
(22)
Finally, a working shorttime approximation of the DMC kernel can be written as
(23)
By considering small enough τ, the residual error (called the shorttime error in the context of QMC) can be made arbitrarily small. In practice, the DMC simulation is performed as follows. A population of walkers [or configuration ] propagated stochastically from generation to generation according to the DMC kernel is introduced. At each
http://www.springerreference.com/index/chapterdbid/333776 © SpringerVerlag Berlin Heidelberg 2012
16 Jul 2012 13:27 4
Michel Caffarel Quantum Monte Carlo Methods in Chemistry
SpringerReference
step, the walkers are moved according to the gaussian transition probability, (22). Next, each walker is killed, kept unchanged, or duplicated a certain number of times proportionally to the remaining part of the KDMC kernel, namely, . In practice, an unbiased integer estimator M defining the number of copies (M = 0, 1, …) is used, M = E[w + u], where E is the integer part and u is a uniform random number in (0, 1) (unbiased ). In contrast with the FokkerPlanck part, this branching (or birthdeath) process causes fluctuations in the number of walkers. Because of that, some sort of population control step is needed [1]. The stationary distribution resulting from these stochastic rules can be obtained as the timeindependent solution of (16). After some simple algebra we get
(24)
provided the reference energy ET is adjusted to the exact value, ET = E0. From this mixed DMC distribution density, a simple and unbiased estimator of the total energy is obtained (25)
For properties other than the energy, the exact distribution density, Ψ02, must be sampled. This can be realized in different ways, for example, by using a forward walking scheme Ref.[4] or a reptation Monte Carlo algorithm, Ref.[5].
The FixedNode Approximation In the preceding section, the DMC approach has been presented without taking care of the specific mathematical constraints resulting from the Pauli principle, (5b). As it is, this algorithm can be directly employed for quantum systems not subject to such constraints (bosonic systems, quantum oscillators, ensemble of distinguishable particles, etc.). An important remark is that the algorithm converges to the stationary density, (24), associated with the lowest eigenfunction which, in the case of a Hamiltonian of the form
, is known to have a constant sign (say,
positive). This property is the generalization to continuous operators of the PerronFrobenius theorem valid for matrices with offdiagonal elements of the same sign. For electronic systems, the additional fermionic constraints are to be taken into account and we must now force the DMC algorithm to converge to the lowest eigenfunction obeying the Pauli principle (the “physical” or fermionic groundstate) and not to the “mathematical” (or bosonic) groundstate having a constant sign. Unfortunately, up to now it has not been possible to define a computationally tractable (polynomial) algorithm implementing exactly such a property for a general fermionic system (known as the “sign problem”). However, at the price of introducing a fixednode approximation, a stable method can be defined. This approach called fixednode DMC (FNDMC) just consists in choosing a trial wavefunction fulfilling the fermionic constraints, (5b). In contrast with the bosonictype simulations where the trial wavefunction does not vanish at finite distances, the walkers are now no longer free to move within the entire configurational space. This property results directly from the fact that the nodes of the trial wavefunction [defined as the (3N − 1)dimensional hypersurface where
] act as infinitely repulsive barriers for the walkers [divergence of the drift vector,
(18)]. Each walker is thus trapped forever within the nodal pocket cut by the nodes of ΨT where it starts from and the Schrödinger equation is now solved with the additional fixednode boundary conditions defined as (26)
When the nodes of ψT coincide with the exact nodes, the algorithm is exact. If not, a fixednode error is introduced. Hopefully, all the nodal pockets do not need to be sampled – which would be an unrealistic task for large systems – due
http://www.springerreference.com/index/chapterdbid/333776 © SpringerVerlag Berlin Heidelberg 2012
16 Jul 2012 13:27 5
Michel Caffarel Quantum Monte Carlo Methods in Chemistry
SpringerReference
to the existence of a “tilling” theorem stating that all the nodal pockets of the fermionic groundstate are essentially equivalent and related by permutational invariance [6]. For a mathematical presentation of the fixednode approximation, see Ref.[7]. Finally, remark that in principle defining an exact fermionic DMC scheme avoiding the fixednode approximation is not difficult. For example, by letting the walkers go through the nodes and by keeping track of the various changes of signs of the trial wavefunction. However, in practice all the schemes proposed up to now are faced with the existence of an exponentially vanishing signaltonoise problem related to the uncontrolled fluctuations of the trial wavefunction sign. For details, the reader is referred to the work by Ceperley and Alder [ 8].
The Trial Wavefunction A standard form for the trial wavefunction is
(27)
where the term
is usually referred to as the Jastrow factor describing explicitly the electronelectron interactions
at different level of approximations. A quite general form employed for
is
(28)
where U’s are simple functions (Many different expressions have been employed). The second part of the wavefunction is quite standard in chemistry and describes the shellstructure of molecules via a linear combination of a product of two Slater determinants built from oneelectron molecular orbitals. Note that several other forms for the trial wavefunction have been introduced in the literature but so far they have remained of marginal use. Finally, let us emphasize that the magnitude of the statistical error and the importance of the fixednode bias being directly related to the quality of the trial wavefunction (both errors vanish in the limit of an exact wavefunction), it is in general quite profitable to optimize the parameters of the trial wavefunction. Several approaches have been proposed, we just mention here the recently proposed method of Umrigar and collaborators [9].
Applications In computational chemistry, the vast majority of the VMC and FNDMC applications have been concerned with the calculation of total energies and differences of total energies: atomization energies, electronic affinities, ionization potentials, reaction barriers, excitedstate energies, etc. To get a brief view of what can be achieved with QMC, let us mention the existence of several benchmark studies comparing FNDMC with the standard DFT and postHF methods [10 –12]. In such studies, FNDMC appears to be as accurate as the most accurate postHF methods and advanced DFT approaches. In addition, like DFT – but in sharp contrast with the postHF methods – the scaling of the computational cost as a function of the system size is favorable, typically in O(N3). However, QMC simulations are much more CPUintensive than DFT ones. To date the largest systems studied involve about 2,000 active electrons, see, e.g., [ 13]. Finally, note that in principle, all chemical properties can be evaluated using QMC. Unfortunately, to reach the desired accuracy is often difficult in practice. More progress is needed to improve the QMC estimators of such properties.
QMC and HighPerformance Computing (HPC) Let us end by emphasizing on one of the most important practical aspect of QMC methods, namely, their remarkable adaptation to high performance computing (HPC) and, particularly, to massive parallel computations. As most Monte Carlo algorithms, the computational effort is almost exclusively concentrated on pure CPU (“number crunching method”).
http://www.springerreference.com/index/chapterdbid/333776 © SpringerVerlag Berlin Heidelberg 2012
16 Jul 2012 13:27 6
Michel Caffarel Quantum Monte Carlo Methods in Chemistry
SpringerReference
In addition, – and this is the key aspect for massive parallelism – calculations of averages can be decomposed at will: n Monte Carlo steps over a single processor being equivalent to n ∕ p steps over p processors with no communication between the processors (apart from the initial/final data transfers). Very recently, it has been demonstrated that an almost perfect parallel efficiency up to about 100,000 compute cores is achievable in practice [14, 15]. In view of the formidable development of computational platforms: Presently up to a few hundreds of thousands compute cores (petascale platforms) and many more soon (exascale in the near future) this property could be critical in assuring the success of QMC in the years to come.
References 1. Foulkes, W.M.C., Mitas, L., Needs, R.J., Rajagopal, G.: Quantum Monte Carlo simulations of Solids. Rev. Mod. Phys. 73, 33–83 (2001) 2. Risken, H.: The FokkerPlanck Equation: Methods of Solutions and Applications. Springer Series in Synergetics, 3rd edn. Springer, Berlin (1996) 3. Gilmore, R.: BakerCampbellHausdorff formulas. J. Math. Phys. 15, 2090–2092 (1974) 4. Caffarel, M., Claverie, P.: Development of a pure diffusion quantum Monte Carlo method using a full generalized FeynmanKac formula. I. Formalism. J. Chem. Phys. 88, 1088–1099 (1988) 5. Baroni, S., Moroni, S.: Reptation quantum Monte Carlo: a method for unbiased groundstate averages and imaginarytime correlations. Phys. Rev. Lett. 82, 4745–4748 (1999) 6. Ceperley, D.M.: Fermion nodes. J. Stat. Phys. 63, 1237–1267 (1991) 7. Cancès, E., Jourdain, B., Lelièvre, T.: Quantum Monte Carlo simulation of fermions. A mathematical analysis of the fixed node approximation. Math. Model Method App. Sci. 16, 1403–1440 (2006) 8. Ceperley, D.M., Alder, B.J.: Quantum Monte Carlo for molecules: Green’s function and nodal release. J. Chem. Phys. 81, 5833–5844 (1984) 9. Umrigar, C.J., Toulouse, J., Filippi, C., Sorella, S., Hennig, R.G.: Alleviation of the Fermionsign problem by optimization of manybody wave functions. Phys. Rev. Lett. 98, 110201 (2007) 10. Manten, S., Lüchow, A.: On the accuracy of the fixednode diffusion quantum Monte Carlo methods. J. Chem. Phys. 115, 5362–5366 (2001) 11. Grossman, J.C.: Benchmark QMCarlo calculations. J. Chem. Phys. 117, 1434–1440 (2002) 12. Nemec, N., Towler, M.D., Needs, R.J.: Benchmark allelectron ab initio quantum Monte Carlo calculations for small molecules. J. Chem. Phys. 132, 0341117 (2010) 13. Sola, E., Brodholt, J.P., Alfè, D.: Equation of state of hexagonal closed packed iron under Earth’s core conditions from quantum Monte Carlo calculations. Phys. Rev. B 79: 0241076 (2009) 14. Esler, K.P., Kim, J., Ceperley, D.M., Purwanto, W., Walter, E.J., Krakauer, H., Zhang, S.: Quantum Monte Carlo algorithms for electronic structure at the petascale; the endstation project. J. Phys. Conf. Ser. 125 012057 (2008) 15. Gillan, M.J., Towler, M.D., Alfè, D.: Petascale computing opens new vistas for quantum Monte Carlo Psik Highlight of the Month (February, 2011) (2011)
http://www.springerreference.com/index/chapterdbid/333776 © SpringerVerlag Berlin Heidelberg 2012
16 Jul 2012 13:27 7
Michel Caffarel Quantum Monte Carlo Methods in Chemistry
SpringerReference
Quantum Monte Carlo Methods in Chemistry Michel Caffarel
Laboratoire de Chimie et Physique Quantiques, IRSAMC, Université de Toulouse, Toulouse, France
DOI:
10.1007/SpringerReference_333776
URL:
http://www.springerreference.com/index/chapterdbid/333776
Part of:
Encyclopedia of Applied and Computational Mathematics
Editors:

PDF created on:
July, 16, 2012 13:27
© SpringerVerlag Berlin Heidelberg 2012
http://www.springerreference.com/index/chapterdbid/333776 © SpringerVerlag Berlin Heidelberg 2012
16 Jul 2012 13:27 8
FULL PAPER
WWW.CCHEM.ORG
Quantum Monte Carlo for Large Chemical Systems: Implementing Efficient Strategies for Petascale Platforms and Beyond Anthony Scemama,*[a] Michel Caffarel,[a] Emmanuel Oseret,[b] and William Jalby[b] Various strategies to implement efficiently quantum Monte Carlo (QMC) simulations for large chemical systems are presented. These include: (i) the introduction of an efficient algorithm to calculate the computationally expensive Slater matrices. This novel scheme is based on the use of the highly localized character of atomic Gaussian basis functions (not the molecular orbitals as usually done), (ii) the possibility of keeping the memory footprint minimal, (iii) the important enhancement of singlecore performance when efficient optimization tools are used, and (iv) the definition of a universal, dynamic, faulttolerant, and loadbalanced framework adapted to all kinds of computational platforms (massively parallel machines, clusters, or distributed grids).
These strategies have been implemented in the QMC¼Chem code developed at Toulouse and illustrated with numerical applications on small peptides of increasing sizes (158, 434, 1056, and 1731 electrons). Using 10–80 k computing cores of the Curie machine (GENCITGCCCEA, France), QMC¼Chem has been shown to be capable of running at the petascale level, thus demonstrating that for this machine a large part of the peak performance can be achieved. Implementation of largescale QMC simulations for future exascale platforms with a comparable level of efficiency is expected to be C 2013 Wiley Periodicals, Inc. feasible. V
Introduction
being equivalent to n/p Monte Carlo steps over p processors with no communication between the processors (apart from the initial/final data transfers). Once the QMC algorithm is suitably implemented the maximum gain of parallelism (ideal scalability) should be expected. A most important point is that mainstream highlevel quantum chemistry methods do not enjoy such a remarkable property. They are essentially based on iterative schemes defined within the framework of linear algebra and involve the manipulation and storage of extremely large matrices. Their adaptation to extreme parallelism is intrinsically problematic. Now, in view of the formidable development of computational platforms, particularly in terms of the number of computing cores (presently up to a few hundreds of thousands and many more to come) the practical bottleneck associated with the highcomputational cost of QMC is expected to become much less critical. Thus, QMC may become in the coming years a method of practical use for treating chemical problems out of the reach of presentday approaches. Following this line of thought, a number of QMC groups are presently working on implementing strategies allowing their QMC codes to run efficiently on very largescale parallel
Quantum Monte Carlo (QMC) is a generic name for a large €dinger equaclass of stochastic approaches solving the Schro tion by using random walks. In the last 40 years, they have been extensively used in several fields of physics including nuclear physics,[1] condensedmatter physics,[2] spin systems,[3] quantum liquids,[4] infrared spectroscopy,[5,6] and so on. In these domains, QMC methods are usually considered as routine methods and even in most cases as stateoftheart approaches. In sharp contrast, this is not yet the case for the electronic structure problem of quantum chemistry, where QMC[7,8] is still of confidential use when compared to the two wellestablished methods of the domain [Density Functional Theory (DFT) and postHartree–Fock methods]. Without entering into the details of the forces and weaknesses of each approach, a major limiting aspect of QMC hindering its diffusion is the high computational cost of the simulations for realistic systems. However—and this is the major concern of this work—a unique and fundamental property of QMC methods is their remarkable adaptation to highperformance computing (HPC) and, particularly, to massively parallel computations. In short, the algorithms are simple and repetitive, central memory requirements may be kept limited whatever the system size, and I/O flows are negligible. As most Monte Carlo algorithms, the computational effort is almost exclusively concentrated on pure CPU (‘‘number crunching method’’) and the execution time is directly proportional to the number of Monte Carlo steps performed. In addition, and this is a central point for massive parallelism, calculations of averages can be decomposed at will: n Monte Carlo steps over a single processor 938
Journal of Computational Chemistry 2013, 34, 938–951
DOI: 10.1002/jcc.23216
[a] A. Scemama, M. Caffarel Laboratoire de Chimie et Physique Quantiques, CNRSIRSAMC, Universit e de Toulouse, France [b] E. Oseret, W. Jalby Exascale Computing Research Laboratory, GENCICEAINTELUVSQ, Universit e de Versailles SaintQuentin, France Email: [email protected] Contract/grant sponsor: ANR (to AS and MC); Contract/grant number: ANR 2011 BS08 004 01. C 2013 Wiley Periodicals, Inc. V
WWW.CHEMISTRYVIEWS.COM
FULL PAPER
WWW.CCHEM.ORG
computers.[9–11] Essentially, most strategies rely on massive parallelism and on some efficient treatment (‘‘linearscaling’’type algorithms) for dealing with the matrix computations and manipulations that represent the most CPUexpensive part of the algorithm. Here, we present several strategies implemented in the QMC¼Chem code developed in our group at the University of Toulouse.[12] A number of actual simulations realized on the Curie machine at the French GENCITGCCCEA computing center with almost ideal parallel efficiency in the range 10,000– 80,000 cores and reaching the petascale level have been realized. The contents of this article are as follows. In the first section, a brief account of the QMC method used is presented. Only those aspects essential to the understanding of the computational aspects discussed in this article are given. In second section, the problem of computing efficiently the Slater matrices at the heart of the QMC algorithm (computational hot spot) is addressed. A novel scheme taking advantage of the highlylocalized character of the atomic Gaussian basis functions [not the molecular orbitals (MOs) as usually done] is proposed. A crucial point is that the approach is valid for an arbitrary molecular shape (e.g., compact molecules), there is no need of considering extended or quasionedimensional molecular systems as in linearscaling approaches. The third section discusses the overall performance of the code and illustrates how much optimizing the singlecore performance of the specific processor at hand can be advantageous. The fourth section is devoted to the way our massively parallel simulations are deployed on a general computational platform and, particularly, how faulttolerance is implemented, a crucial property for any largescale simulation. Finally, a summary of the various strategies proposed in this article is presented in the last section.
rwT ðRÞ ; wT ðRÞ
bðRÞ
(2)
where wT, the trial wave function, is a known computable approximation of the exact wavefunction. At the end of this drift/diffusion step, each walker is killed, kept unchanged, or duplicated a certain number of times proportionally to the branching weight w given by s
0
w ¼ e2½ðEL ðR ÞET ÞþðEL ðRÞET Þ
(3)
where ET is some reference energy and EL the local energy defined as EL ðRÞ
HwT ðRÞ : wT ðRÞ
(4)
The population is propagated and after some equilibrium time it enters a stationary regime, where averages are evaluated. As an important example, the exact energy may be obtained as the average of the local energy. The fixednode approximation. Apart from the statistical and the shorttime (finite time step) errors which can be made arbitrary small, the only systematic error left in a DMC simulation is the socalled fixednode (FN) error. This error results from the fact that the nodes of the trial wavefunction [defined as the (3N 1)dimensional hypersurface where WT(R) ¼ 0] act as infinitely repulsive barriers for the walkers [divergence of the drift vector, eq. (2)]. Each walker is thus trapped forever within the nodal pocket delimited by the nodes of WT where it starts from. When the nodes of wT coincide with the exact nodes, the algorithm is exact. If not, a variational FN error is introduced. However, with the standard trial wavefunctions used, this error is in general small,* a few percent of the correlation energy for total energies.
The QMC Method In this article, we shall consider a variant of the fixednode diffusion Monte Carlo (FNDMC) approach, the standard QMC method used in computational chemistry. Here, we shall insist only on the aspects needed for understanding the rest of the work. For a complete presentation of the FNDMC method, the reader is referred, for example, to Refs. [2], [7], or [8] and references therein. Fixednode diffusion Monte Carlo (FNDMC) Diffusion Monte Carlo. In a diffusion Monte Carlo scheme, a finite population of ‘ configurations’’ or ‘ walkers’’ moving in the 3Ndimensional space (N, number of electrons) is introduced. A walker is described by a 3Ndimensional vector R : (r1, r2,…, rN) giving the positions of the N electrons. At each Monte Carlo step, each walker of the population is diffused and drifted according to
R0 ¼ R þ sbðRÞ þ
pﬃﬃﬃ sg
(1)
where s is a small timestep, g is a Gaussian vector (3N independent normally distributed components simulating a free Brownian diffusion), and b(R) the drift vector given by
Parallelizing FNDMC Each Monte Carlo step is carried out independently for each walker of the population. The algorithm can thus be easily parallelized over an arbitrary number of processors by distributing the walkers among the processors, but doing this implies synchronizations of the CPUs since the branching step requires that all the walkers have first finished their drifteddiffusion step. To avoid this aspect, we have chosen to let each CPU core manage its own population of walkers without any communication between the populations. On each computing unit a population of walkers is propagated and the various averages of interest are evaluated. At the end of the simulation, the
*A word of caution is necessary here. Although the FN error on total energies is indeed usually very small compared with typical errors of standard computational chemistry methods, this error can still be large enough to have a nonnegligible impact on small energy differences of interest in chemistry (binding energies, energy barriers, electronic affinities, etc.). Accordingly, to have to our disposal nodal hypersurfaces of sufficient quality for a general molecular system remains an important issue of QMC approaches.
Journal of Computational Chemistry 2013, 34, 938–951
939
FULL PAPER
WWW.CCHEM.ORG
averages obtained on each processor are collected and summed up to give the final answers. Regarding parallelism the situation is thus ideal since, apart from the negligible initial/final data transfers, there are no communications among processors. The only practical problem left with FNDMC is that the branching process causes fluctuations in the population size and thus may lead to loadbalancing problem among processors. More precisely, nothing prevents the population size from decreasing or increasing indefinitely during the Monte Carlo iterations. To escape from this, a common solution consists in forcing the number of walkers not to deviate too much from some target value for the population size by introducing a population control step. It is usually realized by monitoring in time the value of the reference energy ET via a feedback mechanism, see, for example, Ref. [13]. The price to pay is the introduction of some transient load imbalances and interprocessor communications/synchronization to redistribute walkers among computing cores, inevitably degrading the parallel speedup. This solution has been adapted by several groups and some tricks have been proposed to keep this problem under control.[9–11,14] Here, we propose to avoid this problem directly from the beginning by using a variant of the FNDMC working with a constant number of walkers. Several proposals can be found in the literature, for example, Refs. [15,16]. Here, we shall use the method described in Ref. [16]. In this approach, the branching step of standard DMC is replaced by a socalled reconfiguration step. Defining the normalized branching weights as follows: wk pk ¼ PM
i¼1 wi
(5)
the population of walkers is ‘ reconfigured’’ by drawing at each step M walkers among the M walkers according to the probaP bilities pk. At infinite population, the normalization factor M i¼1 wi is a constant and this step reduces to the standard branching step, where walkers are deleted or duplicated proportionally to the weight w. At finite M, the normalization factor now fluctuates and a finitepopulation bias is introduced. A simple way to remove this error and to recover the exact averages consists in adding to the averages a global weight given by the product of the normalization factors of all preceding generations, thus compensating exactly the same product introduced into the dynamics by successive reconfiguration steps. The price to pay is some increase of statistical fluctuations due to the presence of an additional fluctuating weight. However, this increase is found to be rapidly very moderate when M is increased. In practice, thanks to this algorithm free of a finitepopulation bias, rather small walker populations on each core can be used (typically, we use 10–100 walkers per core). For all details, the reader is referred to Ref. [16].
first and second derivatives (computational hot spot). More precisely, for each walker the values of the trial wavefunction, WT, its first derivatives with respect to all 3Ncoordinates [drift vector, eq. (2)], and its Laplacian !2WT [kinetic part of the local energy, eq. (4)] are to be calculated. It is essential that such calculations be as efficient as possible since in realistic applications their number may be very large (typically of the order of 109–1012). A common form for the trial wavefunction is WT ðRÞ ¼ eJðRÞ
X
cK DetK" ðr1 ; …; rN" ÞDetK# ðrN# ; …; rN Þ:
where the electron coordinates of the N: (respectively, N;) electrons of spin : (respectively, ;) have been distinguished, N ¼ N: þ N;. In this formula, eJ(R) is the Jastrow factor describing explicitly the electron–electron interactions at different levels of approximations. A quite general form may be written as JðRÞ ¼
X
UðenÞ ðria Þ þ
a
X
At each Monte Carlo step, the CPU effort is almost completely dominated by the evaluation of the wavefunction WT and its 940
Journal of Computational Chemistry 2013, 34, 938–951
UðeeÞ ðrij Þ þ
X
UðeenÞ ðrij ; ria ; rja Þ
ai;j
i;j
þ… (7) where rij ¼ ri rj is the interelectronic distance and ria ¼ ri Qa is the distance between electron i and nucleus a located at Qa. Here, U’s are simple functions and various expressions have been used in the literature. The Jastrow factor being essentially local, shortranged expressions can be used and the calculation of this term is usually a small contribution to the total computational cost. As a consequence, we shall not discuss further the computational aspect of this term here. The second part of the wavefunction describes the shellstructure in terms of singleelectron MOs and is written as a linear combination of products of two Slater determinants, one for the : electrons and the other for the ; electrons. Each Slater matrix is built from a set of MOs /i(r) usually obtained from a preliminary DFT or self consistent field (SCF) calculations. The Norb molecular orbitals (MOs) are expressed as a sum over a finite set of Nbasis basis functions [atomic orbitals (AOs)] /i ðrÞ ¼
N basis X
aij vj ðrÞ
(8)
j¼1
where the basis functions vj(r) are usually expressed as a product of a polynomial and a linear combination of Gaussian functions. In the present article, the following standard form is used vðrÞ ¼ ðx Qx Þnx ðy Qy Þny ðz Qz Þnz gðrÞ
(9)
with gðrÞ ¼
Critical CPU part
(6)
K¼ðK" ;K# Þ
X
2
ck eck ðrQÞ :
(10)
k
Here, Q ¼ (Qx, Qy, Qz) is the vector position of the nucleuscenter of the basis function, n ¼ (nx, ny, nz) a triplet of positive WWW.CHEMISTRYVIEWS.COM
FULL PAPER
WWW.CCHEM.ORG
integers, g(r) is the spherical Gaussian component of the AO, and ck its exponents. The determinants corresponding to spin :electrons are expressed as 0
/i1 ðr1 Þ … .. .. B DetK " ðr1 ; …; rN" Þ ¼ [email protected] . . /iN ðr1 Þ … "
1 /i1 ðrN" Þ .. C A . /iN " ðrN" Þ
(11)
where K: is a compact notation for denoting the set of indices {i1, …, iN:} specifying the subset of the MOs used for this particular Slater matrix. A similar expression is written for spin ;electrons. In contrast to the calculation of the Jastrow factor, the evaluation of the determinantal part of the wavefunction and its derivatives is critical. To perform such calculations, we use a standard approach[7] consisting in calculating the matrices of the first and second (diagonal) derivatives of each MO /i with respect to the three space variables l ¼ x, y, z evaluated for each electron position rj, namely, ð1Þ
Dl;i j ð2Þ
Dl;i j
@/i ðrj Þ @xlj @ 2 /i ðrj Þ @x jl
2
(12)
(13)
and then computing the inverse D1 of the Slater matrix defined as Dij ¼ /i(rj). The drift components and the Laplacian corresponding to the determinantal part of the trial wavefunction are thus evaluated as simple vectorproducts X ð1Þ 1 @DetðRÞ ¼ Dl;i j D1 ji i DetðRÞ @xl j¼1; N
(14)
X ð2Þ 1 @ 2 DetðRÞ ¼ Dl;i j D1 ji DetðRÞ @x i 2 j¼1; N
(15)
l
From a numerical point of view, the computational time T needed to evaluate such quantities as a function of the number of electrons N scales as OðN3 Þ T ¼ aN3 þ bN3 :
(16)
The first N3term results from the fact that the N2 matrix elements of the Slater matrices are to be computed, each element being expressed in terms of the Nbasis N basis functions needed to reproduce an arbitrary delocalized MO. The second N3term is associated with the generic cubic scaling of any linear algebra method for inverting a general matrix.
Exploiting the Highly Localized Character of Atomic Basis Functions As seen in the previous section, one of the two computational hot spots of QMC is the calculation of the derivatives of the determinantal part of the trial wave function for each electronic configuration (r1,…,rN) at each Monte Carlo step. To be
more precise, the Norb MO used in the determinantal expansion (6) are to be computed (here, their values will be denoted as C1) together with their first derivatives with respect to x, y, and z (denoted C2,C3,C4) and their Laplacians (denoted C5). Calculations are made in single precision using an efficient matrix product routine we describe now. The matrix products involve the matrix of the MO coefficients aij, eq. (8) (here denoted as A) the matrix of the atomic Gaussian basis functions evaluated at all electronic positions, vj(ri) (denoted B1), their first derivatives (denoted B2,B3,B4), and Laplacians (denoted B5). The five matrix products are written under the convenient form Ci ¼ ABi
i ¼ 1; 5
(17)
Note that matrix A remains constant during the simulation, whereas matrices Bi and Ci depend on electronic configurations. The matrix sizes are as follows: Norb N for the Ci’s, Norb Nbasis for A, and Nbasis N for B. In practical applications, Norb is of the order of N, whereas Nbasis is greater than N by a factor 2 or 3 for standard calculations and much more when using highquality larger basis sets. The expensive part is essentially dominated by the Nbasis multiplications. The total computational effort is thus of order Norb N Nbasis, that is, OðN3 Þ. The standard approach proposed in the literature for reducing the N3price is to resort to the socalled linearscaling or OðNÞtechniques.[17–22] The basic idea consists in introducing spatially localized MOs instead of the standard delocalized (canonical) ones obtained from diagonalization of reference Hamiltonians (usually, Hartree–Fock or Kohn–Sham). Since localized orbitals take their value in a finite region of space—usually in the vicinity of a fragment of the molecule—the number of basis set functions Nbasis needed to represent them with sufficient accuracy becomes essentially independent of the system size (not scaling with N as in the case of canonical ones). In addition to this, each electron contributes only to a small subset of the localized orbitals (those nonvanishing in the region where the electron is located). As a consequence, the number of nonvanishing matrix elements of the Ci matrices no longer scales as Norb N N2 but linearly with N. Furthermore, each matrix element whose computation was proportional to the number of basis set functions used, Nbasis N, is now calculated in a finite time independent of the system size. Putting together these two results, we are led to a linear dependence of the computation of the Ci matrices upon the number of electrons. Here, we choose to follow a different path. Instead of localizing the canonical MOs, we propose to take advantage of the localized character of the underlying atomic Gaussian basis set functions. The advantages are essentially threefold: 1. The atomic basis set functions are naturally localized independently of the shape of the molecule. This is the most important point since the localization procedures are known to be effective for chemical systems having a molecular shape made of wellseparated subunits (e.g., linear systems) but much less for general compact molecular systems that are ubiquitous in chemistry. Journal of Computational Chemistry 2013, 34, 938–951
941
FULL PAPER
WWW.CCHEM.ORG
Figure 1. Molecular systems used as benchmarks.
2. The degree of localization of the standard atomic Gaussian functions is much larger than that obtained for MOs after localization (see results below) 3. By using the product form, eq. (17), the localized nature of the atomic Gaussian functions can be exploited very efficiently (see next section). In practice, when the value of the spherical Gaussian part g(r) of an AO function v(r) is smaller than a given threshold e ¼ 108, the value of the AO, its gradients and Laplacian are considered null. This property is used to consider the matrices B1,…,B5 as sparse. However, in contrast with linearscaling approaches, the MO matrix A is not considered here as sparse. We shall come back to this point later. To accelerate the calculations, an atomic radius is computed as the distance beyond which all the Gaussian components g(r) of the AOs v(r) centered on the nucleus are less than e. If an electron is farther than the atomic radius, all the AO values, gradients and Laplacians centered on the nucleus are set to zero. The practical implementation to perform the matrix products is as follows. For each electron, the list of indices (array ‘ indices’’ in what follows) where g(r) > 0 is calculated. Then, the practical algorithm can be written as C1 ¼ 0. C2 ¼ 0. C3 ¼ 0. C4 ¼ 0. C5 ¼ 0. do i¼1, Number of electrons do k¼1, Number of nonzero AOs for electron i do j¼1, Number of molecular orbitals 942
Journal of Computational Chemistry 2013, 34, 938–951
C1 (j, i) þ¼ A (j, indices (k, i)) *B1 (k, i) C2 (j, i) þ¼ A (j, indices(k, i)) *B2 (k, i) C3 (j, i) þ¼ A (j, indices(k, i)) *B3 (k, i) C4 (j, i) þ¼ A (j, indices(k, i)) *B4 (k, i) C5 (j, i) þ¼ A (j, indices(k, i)) *B5 (k, i) end do end do end do (where x þ¼ y denotes x ¼ x þ y). This implementation allows to take account of the sparsity of the B matrices, while keeping the efficiency due to a possible vectorization of the inner loop. The load/store ratio is 6/5 (6 loadfrommemory instructions, 5 storetomemory instructions) in the inner loop: the elements of Bn are constant in the inner loop (in registers), and the same element of A is used at each line of the inner loop (loaded once per loop cycle). As store operations are more expensive than load operations, increasing the load/store ratio improves performance as will be shown in the next section. Using this algorithm, the scaling of the matrix products is expected to drop from OðN3 Þ to a scaling roughly equal to OðN2 Þ (in a regime where N is large enough, see discussion in the next section). Let us now illustrate such a property in the applications to follow. The different systems used here as benchmarks are represented in Figure 1. The trial wavefunctions used for describing each system are standard Hartree–Fock wavefunctions (no Jastrow factor) with MOs expressed using various Gaussian basis sets. System 1 is a copper complex with four ligands having 158 electrons and described with a ccpVDZ basis set. System 2 is a polypeptide taken from Ref. [23] (434 electrons and 631G* basis set). System 3 (not shown in Figure 1) is identical WWW.CHEMISTRYVIEWS.COM
FULL PAPER
WWW.CCHEM.ORG
Table 1. System sizes, percentage of nonzero molecular orbital coefficients, and average percentage of nonzero atomic orbital values. Smallest system bstrand bstrand TZ Numb. of electrons, N Numb. of basis functions, Nbasis % of nonzero[a] canonical MO coefficients aij(Aij = 0) % of nonzero[a] localized MO coefficients aij(Aij = 0) Average % of nonzero[b] basis functions vi (rj) (B1ij = 0) Average number of nonzero elements per column of B1ij Maximum number of nonzero elements per column of B1ij
1ZE7
1AMB
158
434
434
1056
1731
404
963
2934
2370
3892
(99.4%)
(76.0%
(81.9%)
(72.0%)
(66.1%)
81.3%
48.4%
73.4%
49.4%
37.1%
40.3%
19.1%
9.0%
6.5%
4.5%
163
184
266
155
175
251
298
394
246
305
[a] Zero MO coefficients are those below 105. [b] Zero AO matrix elements are those for which the radial component of the basis function has a value below 108 for given electron positions.
to System 2 but using a larger basis set, namely, the ccpVTZ basis set. System 4 is the 1ZE7 molecule from the Protein Data Bank (1056 electrons, 631G*), and System 5 is the 1AMB molecule from the Protein Data Bank (1731 electrons, 631G*). Table 1 shows the level of sparsity of the matrices A (Aij : aij) and B1 (B1ij : vi(rj)) for the five systems (matrices Bn with n > 1 behave as B1 with respect to sparsity). As seen the number of basis set functions used is proportional to the number of electrons with a factor ranging from about 2.2 to 6.8. Regarding the matrix A of MO coefficients, the results are given both for standard canonical (delocalized) MOs and for localized orbitals. To get the latter ones, different localization schemes have been applied.[24–26] However, they essentially lead to similar results. Here, the results presented are those obtained by using the Cholesky decomposition of the density matrix expressed in the AO basis set.[26] As seen the level of sparsity of the matrix A is low. Although it increases here with the system size it remains modest for the largest size (there are still about one third of nonzero elements). Of course, such a result strongly depends on the type of molecular system considered (compact or not compact) and on the diffuse character of the atomic basis set. Here, we have considered typical systems of biochemistry. Next, the level of sparsity of the B matrices is illustrated. The percentage of nonzero values of vi(rj) has been obtained as an average over a variational Monte Carlo (VMC) run. In sharp contrast with MOs the AOs are much more localized, thus leading to a high level of sparsity. For the largest system, only 3.9% of the basis function values are nonnegligible.
In the last line of the table the maximum number of nonzero elements obtained for all the columns of the matrix during the entire Monte Carlo simulation is given. A first remark is that this number is roughly constant for all system sizes. A second remark is that the maximum number of nonzero values is only slightly greater than the average, thus showing that the B matrices can be considered sparse during the whole simulation, not only in average. As an important consequence, the loop over the number of nonzero AOs for each electron in the practical algorithm presented above (loop over k index) is expected to be roughly constant as a function of the size at each Monte Carlo step. This latter remark implies for this part an expected behavior of order OðN2 Þ for large N. Let us now have a closer look at the actual performance of the code.
Overall Performance of QMC5CHEM When discussing performance several aspects must be considered. A first one, which is traditionally discussed, is the formal scaling of the code as a function of the system size N (N number of electrons). As already noted, due to the innermost calculation, products, and inversion of matrices, such a scaling is expected to be cubic, OðN3 Þ. However, there is a second important aspect, generally not discussed, which is related to the way the expensive innermost floatingpoint operations are implemented and on how far and how efficiently the potential performance of the processor at hand is exploited. In what follows, we shall refer to this aspect as ‘ singlecore optimization.’’ It is important to emphasize that such an aspect is by no way minor and independent on the previous ‘ mathematical’’ one. To explicit this point, let us first recall that the computational time T results essentially from two independent parts, the first one resulting from the computation of the matrix elements, T1 aN3 and the second one from the inversion of the Slater matrix, T2 bN3. Now, let us imagine that we have been capable of devising a highly efficient linearscaling algorithm for the first contribution such that T eN T2 within the whole range of system sizes N considered. We would naturally conclude that the overall computational cost T T2 is cubic. In the opposite case where a very inefficient linearscaling algorithm is used for the first part, T T1 T2, we would conclude to a linearscaling type behavior. Of course, mathematically speaking such a way of reasoning is not correct since scaling laws are only meaningful in the asymptotic regime where N goes to infinity. However, in practice only a finite range of sizes is considered (here, between 2 and about 2000 active electrons) and it is important to be very cautious with the notion of scaling laws. A more correct point of view consists in looking at the global performance of the code in terms of total CPU time for a given range of system sizes, a given compiler, and a given type of CPU core. Finally, a last aspect concerns the memory footprint of the code whose minimization turns out to be very advantageous. Indeed, the current trend in supercomputer design is to increase the number of cores more rapidly than the available total memory. As the amount of memory per core will continue to decrease, it is very likely that programs will need to have a low memory footprint to take advantage of exascale Journal of Computational Chemistry 2013, 34, 938–951
943
FULL PAPER
WWW.CCHEM.ORG
Table 2. Single core performance (GFlops/s) of the matrix products (single precision), inversion (double precision), and overall performance of QMC 5 Chem (mixed single/double precision). Core2
Linpack (DP) Peak Smallest system bStrand bStrand TZ 1ZE7 1AMB
Products
Inversion
18.6 9.8 (52.7%) 9.7 (52.2%) 9.9 (53.2%) 9.3 (50.0%) 9.2 (49.5%)
7.9 (84.9%) 9.3 2.6 (28.0%) 4.3 (46.2%) 4.3 (46.2% 5.2 (55.9% 5.6 (60.2%
Sandy Bridge Overall
Products
Inversion
Overall
3.3 3.7 4.5 4.6 5.0
52.8 26.6 (50.3%) 33.1 (62.7%) 33.6 (63.6%) 30.6 (57.9%) 28.2 (53.4%)
24.3 (92.0%) 26.4 8.8 (33.3%) 13.7 (51.2%) 13.7 (51.2%) 15.2 (57.6%) 16.2 (61.4%)
6.3 13.0 14.0 17.9 17.8
The percentage of the peak performance is given in parentheses. Core2: Intel Xeon 5140, Core2 2.33 GHz, Dual core, 4 MiB shared L2 cache. Sandy Bridge: Intel Xeon E3–1240, Sandy Bridge 3.30 GHz, Quad core, 256 KiB L2 cache/core, 8 MiB shared L3 cache (3.4 GHz with turbo).
computers. Another point is that when less memory is used less electrical power is needed to perform the calculation: data movement from the memory modules to the cores needs more electrical power than performing floating point operations. Although at present time the power consumption is not yet a concern to software developers, it is a key aspect in present design of the exascale machines to come. In this section, the results discussed will be systematically presented by using two different generations of Intel Xeon processors. The first processor, referred to as Core2, is an Intel Xeon 5140, Core2 2.33 GHz, Dual core, 4 MiB shared L2 cache. The second one, referred to as Sandy Bridge, is an Intel Xeon E31240 at 3.30 GHz, Quad core, 256 KiB L2 cache/core, 8 MiB shared L3 cache (3.4 GHz with turbo). Note also that the parallel scaling of QMC being close to ideal (see next section), singlecore optimization is very interesting: the gain in execution time obtained on the singlecore executable is directly transferred to the whole parallel simulation. Improving the innermost expensive floatingpoint operations For the Core2 architecture, the practical algorithm presented above may be further improved by first using the unroll and jam technique,[27] which consists in unrolling the outer loop and merging multiple outerloop iterations in the inner loop: do i¼1, Number of electrons do k¼1, Number of nonzero AOs for electron i, 2 do j¼1, Number of molecular orbitals C1 (j, i) þ¼ A (j, indices (k, i)) *B1 (k, i) þ & A (j, indices (kþ1, i)) * B1 (kþ1, i) C2 (j, i) þ¼ A (j, indices (k, i)) * B2 (k ,i) þ & A (j, indices (kþ1, i)) * B2 (kþ1, i) ... end do end do end do To avoid register spilling, the inner loop is split in two loops: one loop computing C1, C2, C3 and a second loop computing C4, C5. The load/store ratio is improved from 6/5 to 5/3 and 4/2. 944
Journal of Computational Chemistry 2013, 34, 938–951
For the Sandy Bridge architecture, the external body is unrolled four times instead of two, and the most internal loop is split in three loops: one loop computing C1, C2, a second loop computing C3, C4, and a third loop computing C5. The load/store ratio is improved from 6/5 to 6/2 and 5/1. Then, all arrays were 256bit aligned using compiler directives and the first dimensions of all arrays were set to a multiple of eight elements (if necessary, padded with zeros at the end of each column) to force a 256bit alignment of every column of the matrices. These modifications allowed the compiler to use only vector instructions to perform the matrix products, both with the Streaming SIMD Extension (SSE) or the Advanced Vector Extension (AVX) instruction sets. The x86_64 version of the MAQAO framework[28] indicates that, as the compiler unrolled twice the third loop (C5), these three loops perform 16 floating point operations per cycle, which is the peak performance on this architecture. Finally, to improve the cache hit probability, blocking was used on the first dimension of Bn (loop over k). In each block, the electrons (columns of B) are sorted by ascending first element of the indices array in the block. This increases the probability that columns of A will be in the cache for the computation of the values associated with the next electron. The results obtained using the Intel Fortran Compiler XE 2011 are presented in Table 2 for both the Core2 and the Sandy Bridge architectures. The singlecore doubleprecision Linpack benchmark is also mentioned for comparison. The results show that the full performance of the matrix products is already reached for the smallest system. However, as opposed to dense matrix product routines, we could not approach further the peak performance of the processor since the number of memory accesses scales as the number of floating point operations (both OðN2 Þ): the limiting factor is inevitably the data access. Nevertheless, the DECAN tool[29] revealed that data access only adds a 30% penalty on the pure arithmetic time, indicating an excellent use of the hierarchical memory and the prefetchers. Singlecore performance Computational cost as a function of the system size. In Table 3, the memory required together with the CPU time obtained for WWW.CHEMISTRYVIEWS.COM
FULL PAPER
WWW.CCHEM.ORG
Table 3. Singlecore memory consumption and elapsed time for one VMC step.
RAM (MiB) Core2 QMC step(s) Inversion Products Sandy Bridge QMC step(s) Inversion Products
Smallest system
bstrand
bstrand TZ
1ZE7
1AMB
9.8
31
65
133
313
0.0062 15% 25%
0.0391 31% 23%
0.0524 21% 35%
0.2723 47% 21%
0.9703 58% 18%
0.0026 12% 24%
0.0119 26% 22%
0.0187 17% 32%
0.0860 42% 21%
0.3042 52% 20%
Values in % represent the percentage of the total CPU time. Core2: Intel Xeon 5140, Core2 2.33 GHz, Dual core, 4 MiB shared L2 cache. Sandy Bridge: Intel Xeon E3–1240, Sandy Bridge 3.30 GHz, Quad core, 256 KiB L2 cache/core, 8 MiB shared L3 cache (3.4 GHz with turbo).
one VMC step for the five systems are presented using both processors. The two expensive computational parts (matrix products and inversion) are distinguished. A first remark is that the trends for both processors are very similar so we do not need to make a distinction at this moment. A second remark is that the memory footprint of QMC¼Chem is particularly low. For the biggest size considered (1731 electrons), the amount of RAM needed is only 313 MiB. Finally, another important remark is that at small number of electrons the multiplicative part is dominant while this is not the case at larger sizes. Here, the change of regime is observed somewhere between 400 and 1000 electrons but its precise location depends strongly on the number of basis functions used. For example, for systems 3 and 4 corresponding to the same molecule with a different number of basis functions, the multiplicative part is still dominant for the larger basis set (bstrand with ccpVTZ) while it is no longer true for the smaller basis set (bstrand with 631G*). In Figure 2, a plot of the total computational time for the Sandy Bridge core as a function of the number of electrons is presented. A standard fit of the curve with a polynomial form Nc leads to a cvalue of about 2.5. However, as discussed above such a power is not really meaningful. From the data of Table 3, it is easy to extract the pure contribution related to the inversion and a factor very close to 3 is obtained, thus illustrating that for this linear algebra part we are in the asymptotic regime. For the multiplicative part, the pure N2 behavior is not yet recovered and we are in an intermediate regime. Putting together these two situations leads to some intermediate scaling around 2.5. Sparsity. In our practical algorithm, for the matrix products we have chosen to consider the B matrices as sparse as opposed to the A matrix which is considered dense. The reason for that is that considering the matrix A sparse would not allow us to write a strideone inner loop. In single precision, SSE instructions executed on Intel processors can perform up to eight instructions per CPU cycle (one fourelement vector ADD instruction and one fourelement vector MUL instruction in parallel). Using the latest AVX instruction set available on the Sandy Bridge architecture, the width of the SIMD vector registers have been doubled and the CPU can now perform up to 16 floating point operations per cycle. A necessary condition for enabling vectorization is a strideone access to the data.
This implies that using a sparse representation of A would disable vectorization, and reduce the maximum number of floating operations per cycle by a factor of four using SSE (respectively, eight using AVX). If matrix A has more than 25% (respectively, 12.5%) nonzero elements, using a sparse representation is clearly not the best choice. This last result is a nice illustration of the idea that the efficiency of the formal mathematical algorithm depends on the core architecture. Inversion step. Now, let us consider the inversion step which is the dominant CPUpart for the big enough systems (here, for about a thousand electrons and more). In Table 2, the performance in GFlops/s of the inversion step is presented for both processors. For comparisons, the theoretical singlecore peak and singlecore Linpack performance are given. For each processor the third column gives the overall performance of the code while the second column is specific to the inversion part. As seen the performance of both parts increases with the number of electrons. For largest systems, the performance represents more than 50% of the peak performance of each processor. For the largest system, the whole code has a performance of about 54% of the peak performance for the Core2 and about 61% for the Sandy Bridge. The performance is still
Figure 2. Singlecore scaling with system size. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
Journal of Computational Chemistry 2013, 34, 938–951
945
FULL PAPER
WWW.CCHEM.ORG
Table 4. Number of million CPU cycles needed for the computation of the values, gradients and Laplacians of the molecular orbitals using the Einspline package and using our implementation for the Core2 and the Sandy Bridge microarchitectures. The ratio QMC 5 Chem/Einspline is also given. Core2
Smallest system bstrand TZ 1ZE7 1AMB
QMC ¼ Chem
Einspline
Ratio
QMC ¼ Chem
Einspline
Ratio
16.7 177.3 783.5 2603.0
15.0 113.0 669.1 1797.8
1.11 1.57 1.17 1.45
9.2 81.7 352.0 1183.9
10.6 80.1 473.9 1273.5
0.87 1.02 0.74 0.93
better for the inversion part: 60.2% for the Core2 and 61.4% for the Sandy Bridge. Determinant calculation compared to spline interpolation. Most authors use threedimensional spline representations of the MOs to compute in constant time the values, first derivatives and Laplacians of one electron in one MO, independently of the size of the atomic basis set. This approach seems efficient at first sight, but the major drawback is that the memory required for a single processor can become rapidly prohibitive since each MO has to be precomputed on a threedimensional grid. To overcome the largememory problem, these authors use shared memory approaches on the computing nodes, which implies coupling between the different CPU cores. In this paragraph, we compare the wall time needed for spline interpolation or computation of the values, first derivatives and Laplacians of the wave function at all electron positions. Version 0.9.2 of the Einspline package[30] was used as a reference to compute the interpolated values, gradients and Laplacians of 128 MOs represented on 23 21 29 single precision arrays. The ‘ multiple uniform splines’’ set of routines were used. To evaluate the value, gradient and Laplacian of one MO at one electron coordinate, an average of 1200 CPU cycles was measured using LIKWID[31] on the Core2 processor versus 850 CPU cycles on the Sandy Bridge processor. Even if the interpolation is done using a very small amount of data and of floating point operations, it is bound by the memory latency. Indeed, the needed data is very unlikely to be in the CPU cache and this explains why the number of cycles per matrix element is quite large. As our code uses a very small amount of memory, and as the computationally intensive routines are very well vectorized by the compiler, the computation of the matrix elements is bound by the floating point throughput of the processor. The number of cycles needed to build the C1 … C5 matrices is the number of cycles needed for one matrix element scaled by the number of matrix elements N2a þ N2b . Table 4 shows the number of CPU cycles needed to build the full C1 … C5 matrices for a new set of electron positions using spline interpolation or using computation. The computation includes the computation of the values, gradients and Laplacians of the AOs (matrices B1 … B5) followed by the matrix products. Using a rather small basis set (631G*), the computation of the matrices in the 158electron system is only 10% slower than the interpolation on the Core2 architecture. Using a larger basis set (ccpVTZ), the computation is only 57% slower. As the frequency is higher in our Sandy Bridge processor than in our Core2 processor, we would have expected the number of cycles of one memory latency to increase, and therefore we would have 946
Sandy Bridge
Journal of Computational Chemistry 2013, 34, 938–951
expected the Einspline package to be less efficient on that specific processor. One can remark that the memory latencies have been dramatically improved from the Core2 to the Sandy Bridge architectures and the number of cycles for the interpolation decreases. The full computation of the matrix elements benefits from the improvement in the memory accesses, but also from the enlargement of the vector registers from 128 to 256 bits. This higher vectorization considerably reduces the number of cycles needed to perform the calculation such that in the worst case (the largest basis set), the full computation of the matrix elements takes as much time as the interpolation. In all other cases, the computation is faster than the spline interpolation. Finally, let us mention that as the memory controller is directly attached to the CPU, on multisocket computing nodes the memory latencies are higher when accessing a memory module attached to another CPU (Nonuniform memory access (NUMA) architecture).
Parallelism: Implementing a Universal, Dynamic, and FaultTolerant Scheme Our objective was to design a program that could take maximum advantage of heterogeneous clusters, grid environments, the petaflops platforms available now and those to come soon (exascale). To achieve the best possible parallel speedup on any hardware, all the parallel tasks have to be completely decoupled. Feldman et al. have shown that a naive implementation of parallelism does not scale well on commodity hardware.[32] Such bad scalings are also expected to be observed on very largescale simulations. Therefore, we chose an implementation where each CPU core realizes a QMC run with its own population of walkers independently of all the other CPU cores. The run is divided in blocks over which the averages of the quantities of interest are computed. The only mandatory communications are the onetoall communication of the input data and the alltoone communications of the results, each result being the Monte Carlo average computed with a singlecore executable. If a singlecore executable is able to start as soon as the input data is available and stop at any time sending an average over all the computed Monte Carlo steps, the best possible parallel speedup on the machine can always be obtained. This aspect is detailed in this section. Faulttolerance Faulttolerance is a critical aspect since the mean time before failure increases with the number of hardware components: using N identical computing nodes for a singe run multiplies by WWW.CHEMISTRYVIEWS.COM
FULL PAPER
WWW.CCHEM.ORG
Figure 3. Overview of the QMC¼Chem architecture. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
N the probability of failure of the run. If one computing node is expected to fail once a year, a run using 365 computing nodes is not expected to last more than a day. As our goal is the use both of massive resources and commodity clusters found in laboratories, hardware failure is at the center of our software design. The traditional choice for the implementation of parallelism is the use of the message passing interface (MPI).[33] Efficient libraries are proposed on every parallel machine, and it is probably the best choice in most situations. However, all the complex features of MPI are not needed for our QMC program, and it does not really fit our needs: in the usual MPI implementations, the whole run is killed when one parallel task is known not be able to reach the MPI_Finalize statement. This situation occurs when a parallel task is killed, often due to a system failure (I/O error, frozen computing node, hardware failure, etc). For deterministic calculations where the result of every parallel task is required, this mechanism prevents from unexpected dead locks by immediately stopping a calculation that will never end. In our implementation, as the result of the calculation of a block is a Gaussian distributed random variable, removing the result of a block from the simulation is not a problem since doing that does not introduce any bias in the final result. Therefore, if one computing node fails, the rest of the simulation should survive. We wrote a simple Python TCP client/server application to handle parallelism. To artificially improve the bandwidth, all network transfers are compressed using the Zlib library,[34] and the results are transferred asynchronously in large packets containing a collection of small messages. Similarly, the storage of the results is executed using a nonblocking mechanism. The computationally intensive parts were written using the IRPF90 code generator,[35] to produce efficient Fortran code that is also easy to maintain. The architecture of the whole program is displayed in Figure 3. Program interface Our choice concerning the interaction of the user with the program was not to use the usual ‘ input file and output file’’ structure. Instead, we chose to use a database containing all the input data and control parameters of the simulation, and
also the results computed by different runs. A few simple scripts allow the interaction of the user with the database. This choice has several advantages: • The input and output data are tightly linked together. It is always possible to find to which input corresponds output data. • If an output file is needed, it can be generated on demand using different levels of verbosity. • Graphical and web interfaces can be trivially connected to the program. • Simple scripts can be written by the users to manipulate the computed data in a way suiting their needs. Instead of storing the running average as the output of a run, we store all the independent blockaverages in the database, and the running averages are postprocessed on demand by database queries. There are multiple benefits from this choice: • Checkpoint/restart is always available • It is possible to compute correlations, combine different random variables, and so on, even when the QMC run is finished. • Combining results computed on different clusters consists in simply merging the two databases, which allows automatically the use of the program on computing grids.[36] • Multiple independent jobs running on the same cluster can read/write in the same database to communicate via the file system. This allows to gather more and more resources as they become available on a cluster or to run a massive number of tasks in a best effort mode.† Error checking We define the critical data of a simulation as the input data that characterizes uniquely a given simulation. For instance, the molecular coordinates, the MOs, the Jastrow factor parameters are critical data since they are fixed parameters of the wave function during a QMC run. In contrast, the number of walkers of a simulation is not critical data for a VMC run since
† When cluster resources are unused, a QMC job starts. When another user requests the resources, the QMC job is killed.
Journal of Computational Chemistry 2013, 34, 938–951
947
FULL PAPER
WWW.CCHEM.ORG
Figure 4. Connections of the forwarders with the data server.
the results of two VMC simulations with a different number of walkers can be combined together. A 32bit cyclic redundancy code (CRC32 key) is associated with the critical data to characterize a simulation. This key will be used to guarantee that the results obtained in one simulation will never be mixed with the results coming from another simulation and corrupt the database. It will also be used to check that the input data have been well transferred on every computing node. Program execution When the program starts its execution, the manager process runs on the master node and spawns two other processes: a data server and a main worker process. At any time, new clients can connect to the data server to add dynamically more computational resources to a running calculation, and some running clients can be terminated without stopping the whole calculation. The manager periodically queries the database and computes the running averages using all the blocks stored in the database. It controls the running/stopping state of the workers by checking if the stopping condition is reached (based, e.g., on the wallclock time, on the error bar of the average energy, a Unix signal, etc). When running on supercomputers, the main worker process spawns one single instance of a forwarder on each computing node given by the batch scheduler system using an MPI launcher. As soon as the forwarders are started the MPI launcher terminates, and each forwarder connects to the data server to retrieve the needed input data. The forwarder then starts multiple workers on the node with different initial walker positions. Each worker is an instance of the singlecore Fortran executable, connected to the forwarder by Unix pipes. Its behavior is the following:while (.True.) { compute_a_block_of_data(); send_the_results_to_the_forwarder(); } Unix signals SIGTERM and SIGUSR2 are trapped to trigger the send_the_results_to_the_forwarder procedure followed by the termination of the process. Using this mechanism, any singlecore executable can be stopped immediately without losing a single Monte Carlo step. This aspect is essential to obtain the best possible speedup on massively parallel machines. Indeed, using the matrix product presented in the previous section 948
Journal of Computational Chemistry 2013, 34, 938–951
makes the CPU time of a block nonconstant. Without this mechanism, the run would finish when the last CPU finishes, and the parallel efficiency would be reduced when using a very large number of CPU cores. While the workers are computing the next block, the forwarder sends the current results to the dataserver using a path going through other forwarders. The forwarders are organized in a binary tree as displayed in Figure 4: every node of the tree can send data to all its ancestors, to deal with possible failures of computing nodes. This treeorganization reduces the number of connections to the data server, and also enlarges the size of the messages by combining in a single message the results of many forwarders. At the end of each block, the last walker positions are sent from the worker to the forwarder. The forwarder keeps a fixedsized list of Nkept walkers enforcing the distribution of local energies: when a forwarder receives a set of N walkers, it appends the list of new walkers to its Nkept list, and sorts the Nkept þ N list by increasing local energies. A random number g is drawn to keep all list entries at indices bg þ i (Nkept þ N)/ Nkeptc, i ¼ {1,…Nkept}. After a random timeout, if the forwarder is idle, it sends its list of walkers to its parent in the binary tree which repeats the list merging process. Finally, the data server receives a list of walkers, merges it with its own list and writes it to disk when idle. This mechanism ensures that the walkers saved to disk will represent homogeneously the whole run and avoids sending all the walkers to the data server. These walkers will be used as new starting points for the next QMC run. Using such a design the program is robust to system failures. Any computing node can fail with a minimal impact on the simulation: • If a worker process fails, only the block being computed by this worker is lost. It does not affect the forwarder to which it is linked. • If a forwarder fails, then only one computing node is lost thanks to the redundancy introduced in the binary tree of forwarders. • The program execution survives short network disruption (a fixed timeout parameter). The data will arrive to the data server when the network becomes operational again. • The disks can crash on the computing nodes: the temporary directory used on the computing nodes is a RAMdisks (/ dev/shm). WWW.CHEMISTRYVIEWS.COM
FULL PAPER
WWW.CCHEM.ORG
• The shared file system can fail as the singlecore static executable, the python scripts and input files are broadcast to the RAMdisks of the compute nodes with the MPI launcher when the run starts. • Redundancy can be introduced on the data server by running multiple jobs using the same database. Upon a failure of a data server, only the forwarders connected to it will be lost. • In the case of a general power failure, all the calculations can be restarted without losing what has already been stored in the database. Finally, we have left the possibility of using different executables connected to the same forwarder. This will allow a combined use of pure CPU executables with hybrid CPU/GPU and CPU/MIC executables, to use efficiently all the available hardware. The extension to hybrid architectures will be the object of a future work. Parallel speedup The benchmarks presented in this section were performed on the Curie machine (GENCITGCCCEA, France). Each computing node is a dual socket Intel Xeon E52680: 2 (8 cores, 20 MiB shared L3cache, 2.7 GHz) with 64 GiB of RAM. The benchmark is a DMC calculation of the bstrand system with the ccPVTZ basis set (Table 1) using 100 walkers per core performing 300 steps in each block. Note that these blocks are very short compared to realistic simulations, where the typical number of steps would be larger than 1000 to avoid the correlation between the block averages. Intra node. The CPU consumption of the forwarder is negligible (typically 1% of the CPU time spent in the singlecore executables). The speedup with respect to the number of sockets is ideal. Indeed, the singlecore binaries do not communicate between each other, and as the memory consumption per core is very low, each socket never uses memory modules attached to another socket. When multiple cores on the same socket are used, we observed a slowdown for each core due to the sharing of the L3cache and memory modules. Running simultaneously 16 instances of the singlecore binaries on our benchmark machine yields an increase of 10.7% of the wallclock time compared to running only one instance. For a 16core run, we obtained a 14.4 speedup (the Turbo feature of the processors was deactivated for this benchmark). Inter node. In this section, the wallclock time is measured from the very beginning to the very end of the program execution using the standard GNU time tool. Hence, the wallclock time includes the initialization and finalization steps. The initialization step includes • Input file consistency checking • Creating a gzipped tar file containing the input files (wave function parameters, simulation parameters, a pool of initial walkers), the Python scripts and static singlecore executable needed for the program execution on the slave nodes • MPI initialization • Broadcasting the gzipped tar file via MPI to all the slave nodes
Figure 5. Parallel speedup of QMC¼Chem with respect to 16core compute nodes (reference is one 16core node).
• • •
Extracting the tar file to the RAMdisk of the slave nodes Starting the forwarders Starting the singlecore instances.
Note that as no synchronization is needed between the nodes, the computation starts as soon as possible on each node. The finalization step occurs as follows. When the data server receives a termination signal, it sends a termination signal to all the forwarders that are leaves in the tree of forwarders. When a forwarder receives such a signal, it sends a SIGTERM signal to all the singlecore binary instances of the computing node which terminate after sending to the forwarder the averages computed over the truncated block. Then, the forwarder sends this data to its parent in the binary tree with a termination signal and sends a message to the data server to inform it that it is terminated. This termination step walks recursively through the tree. When all forwarders are done, the data server exits. Note that if a failure happened on a node during the run, the data server never receives the message corresponding to a termination of the corresponding forwarder. Therefore, when the data server receives the termination signal coming from the forwarders tree, if the data server is still running after a given timeout it exits. We prepared a 10min run for this section to compute the parallel speedup curve as a function of the number of 16core nodes given in Figure 5. The data corresponding to this curve are given in Table 5. The reference for the speedup is the onenode run. The speedup for N nodes is computed as: tCPU ðNÞ=tWall ðNÞ : tCPU ð1Þ=tWall ð1Þ
(18)
The initialization time was 9 s for the single node run and 22 s for the 1000 nodes run. The finalization time was 13 s for the single node run and 100 s for the 1000 nodes run. Apart from the initialization and finalization steps (which obviously do not depend on the total execution time), the Journal of Computational Chemistry 2013, 34, 938–951
949
FULL PAPER
WWW.CCHEM.ORG
Table 5. Data relative to the scaling curve (Fig. 5). Number of 16core nodes 1 10 25 50 100 200 400 500 1000
10 min CPU (s)
Wall (s)
60 min (estimated) Speedup
9627
625
1.0
95,721 239,628 477,295 952,388
627 629 631 636
9.9 24.7 49.1 97.2
1,869,182 3,725,538 4,479,367 8,233,981
637 648 641 713
190.5 373.3 453.7 749.7
CPU (s)
Wall (s)
57,147 57,332 570,921 1427,628 2,853,295 5704,388 5,708,422 11,373,182 22,733,538 28,239,367 55,753,981
3625 3629 3627 3629 3631 3636 3638 3637 3648 3641 3713
Speedup 1.00 1.00 9.98 24.95 49.85 99.52 99.32 198.36 395.30 491.98 952.50
180 min (estimated) speedup 1.00 9.99 24.98 49.95 99.84 199.45 398.42 497.31 983.86
CPU time is the cumulated CPU time spent only in the Fortran executables, Wall time is the measured wallclock time, including initialization and finalization steps (serial). The 10min run were measured, and the longer runs are estimated from the 10min run data. Two checks were measured for the 60min runs with 1 and 100 nodes (in italics).
parallel speedup is ideal. This allowed us to estimate the speedups, we would have obtained for a 1h run and for a 3h run. For instance, to estimate the 1h run we added 50 min to the wallclock time and 50 min 16 number of nodes 0.99 to the CPU time. The 99% factor takes account of the CPU consumption of the forwarder for communications. Our simple model was checked by performing a 1h run on one node and a 1h run on 100 nodes. An excellent agreement with the prediction was found: a 99.5 speedup was predicted for 100 nodes and a 99.3 speedup was measured. Finally, a production run was made using 76,800 cores of Curie (4800 nodes) on the bstrand molecule with a ccpVTZ basis set via 12 runs of 40 nodes, and a sustained performance of 960 TFlops/s was measured. All the details and scientific results of this application will be presented elsewhere (Caffarel and Scemama, Unpublished).
Summary Let us summarize the main results of this work. First, to enhance the computational efficiency of the expensive innermost floatingpoint operations (calculation and multiplication of matrices), we propose to take advantage of the highly localized character of the atomic Gaussian basis functions, in contrast with the standard approaches using localized MOs. The advantages of relying on atomic localization have been illustrated on a series of molecules of increasing sizes (number of electrons ranging from 158 to 1731). In this article, it is emphasized that the notion of scaling of the computational cost as a function of the system size has to be considered with caution. Here, although the algorithm proposed is formally quadratic it displays a small enough prefactor to become very efficient in the range of number of electrons considered. Furthermore, our implementation of the linearalgebra computational part has allowed to enlighten a fundamental issue rarely discussed, namely the importance of taking into consideration the close links between algorithmic structure and CPU core architecture. Using efficient techniques and optimization tools for enhancing singlecore performance, this point has been illustrated in vari950
Journal of Computational Chemistry 2013, 34, 938–951
ous situations. Remark that this aspect is particularly important: as the parallel speedup is very good, the gain in execution time obtained for the singlecore executable will also be effective in the total parallel simulation. In our implementation, we have chosen to minimize the memory footprint. This choice is justified first by the fact that today the amount of memory per CPU core tends to decrease and second by the fact that small memory footprints allow in general a more efficient usage of caches. In this spirit, we propose not to use 3Dspline representation of the MOs as usually done. We have shown that this can be realized without increasing the CPU cost. For our largest system with 1731 electrons, only 313 MiB of memory per core was required. As a consequence, the key limiting factor of our code is only the available CPU time and neither the memory nor disk space requirements, nor the network performance. Let us reemphasize that this feature is well aligned with the current trends in computer architecture for large HPC systems. Finally, let us conclude by the fact that there is no fundamental reason why the implementation of such a QMC simulation environment which has been validated at petaflops level could not be extended to exascale.
Acknowledgments This work was possible thanks to the generous computational support from CALMIP (Universite de Toulouse), under the allocation 20110510, GENCI under the allocation GEN1738, CCRT (CEA), and PRACE under the allocation RA0824. The authors would also like to thank Bull, GENCI and CEA for their help in this project. Keywords: quantum Monte Carlo petascale parallel speedup singlecore optimization large systems How to cite this article: A. Scemama, M. Caffarel, E. Oseret, W. Jalby, J. Comput. Chem. 2013, 34, 938–951. DOI: 10.1002/jcc.23216
[1] S. Gandolfi, F. Pederiva, S. Fantoni, K. Schmidt, Phys. Rev. Lett. 2007, 98, 102503, available at: http://link.aps.org/doi/10.1103/PhysRevLett.98.102503
WWW.CHEMISTRYVIEWS.COM
FULL PAPER
WWW.CCHEM.ORG
[2] W. Foulkes, L. Mitas, R. Needs, G. Rajagopal, Rev. Mod. Phys. 2001, 73, 33; available at: http://link.aps.org/doi/10.1103/RevModPhys.73.33 [3] M. Suzuki, Quantum Monte Carlo Methods in Condensed Matter Physics; World Scientific: Singapore, 1994. [4] D. Ceperley, Rev. Mod. Phys. 1995, 67, 279. [5] D. Coker, R. Watts, J. Phys. Chem. 1987, 91, 2513. [6] M. Caffarel, P. Claverie, C. Mijoule, J. Andzelm, D. Salahub, J. Chem. Phys. 1989, 90, 990. [7] B. Hammond, W. Lester, Jr., P. Reynolds, Monte Carlo Methods in Ab Initio Quantum Chemistry, Vol. 1 of Lecture and Course Notes in Chemistry; World Scientific: Singapore, 1994; ISBN 9789810203221, World Scientific Lecture and Course Notes in Chemistry, Vol. 1. [8] M. Caffarel, In Encyclopedia of Applied and Computational Mathematics; B. Engquist, Ed.; Springer, 2012; available at: http://qmcchem.upstlse.fr/files/caffarel/qmc.pdf. Accessed December 21, 2012. [9] K. P. Esler, J. Kim, D. M. Ceperley, W. Purwanto, E. J. Walter, H. Krakauer, S. Zhang, P. R. C. Kent, R. G. Hennig, C. Umrigar, M. Bajdich, J. Kolorencˇ, L. Mitas, A. Srinivasan, J. Phys.: Conf. Series 2008, 125, 1. [10] J. Kim, K. Esler, J. McMinis, D. M. Ceperley, In Proceedings of the 2010 Scientific Discovery through Advanced Computing (SciDac) Conference, Tennessee, 11–15 July, 2010, Oak Ridge National Laboratory. [11] M. Gillan, M. Towler, D. Alf, In Psik Highlight of the Month, February, 2011. [12] Quantum Monte Carlo for [email protected], available at: http:// qmcchem.upstlse.fr. Accessed December 21, 2012. [13] C. Umrigar, M. Nightingale, K. Runge, J. Chem. Phys. 1993, 99, 2865. [14] J. Krogel, D. Ceperley, In Advances in Quantum Monte Carlo; W. L. J. S. Tanaka, S. Rothstein, Eds.; Vol. 1094 of ACS Symposium Series, 2012i; pp. 13–26. [15] M. C. Buonaura, S. Sorella, Phys. Rev. B 1998, 57, 11446. [16] R. Assaraf, M. Caffarel, A. Khelif, Phys. Rev. E 2000, 61, 4566. [17] A. Williamson, R. Hood, J. Grossman, Phys. Rev. Lett. 2001, 87, 246406. [18] D. Alfe`, M. J. Gillan, J. Phys.: Condens. Matter 2004, 16, 305. [19] A. AspuruGuzik, R. S. NFerrer, B. Austin, W. Lester, Jr., J. Comput. Chem. 2005, 26, 708. [20] F. A. Reboredo, A. J. Williamson, Phys. Rev. B 2005, 71, 121105(R). [21] J. Kussmann, H. Riede, C. Ochsenfeld, Phys. Rev. B 2007, 75, 165107. [22] J. Kussmann, C. Ochsenfeld, J. Chem. Phys. 2008, 128, 134104.
[23] S. Nagarajan, J. Rajadas, E. P. Malar, J. Struct. Biol. 2010, 170, 439, ISSN 10478477, available at: http://www.sciencedirect.com/science/article/ pii/S104784771000064X [24] S. Boys, Rev. Mod. Phys. 1960, 32, 296. [25] J. Pipek, P. Mezey, J. Chem. Phys. 1989, 90, 4916. [26] F. Aquilante, T. B. Pedersen, A. Sanchez de Mers, H. Koch, J. Chem, Phys. 2006, 125, 174101, available at: http://link.aip.org/link/?JCP/125/ 174101/1 [27] N. Zingirian, M. Maresca, In HighPerformance Computing and Networking; P. Sloot, M. Bubak, A. Hoekstra, B. Hertzberger, Eds.; Springer: Berlin/Heidelberg, 1999; Vol. 1593 of Lecture Notes in Computer Science, pp. 633–642, ISBN 9783540658214, 10.1007/BFb0100624. [28] L. Djoudi, D. Barthou, P. Carribault, C. Lemuet, J. T. Acquaviva, W. Jalby, In Workshop on EPIC Architectures and Compiler Technology, San Jose, CA, 2005. [29] S. Koliai, S. Zuckerman, E. Oseret, M. Ivascot, T. Moseley, D. Quang, W. Jalby, In LCPC, 2009, pp. 111–125. [30] http://einspline.sourceforge.net. Accessed December 21, 2012. [31] J. Treibig, G. Hager, G. Wellein, In 39th International Conference on Parallel Processing Workshops (ICPPW), San Diego, CA, 2010; pp. 207– 216, ISSN 1530–2016. [32] M. T. Feldman, C. Julian, D. R. Cummings, R. Kent IV, R. P. Muller, W. A. Goddard III, J. Comput. Chem. 2008, 29, 8. [33] W. Gropp, E. Lusk, N. Doss, A. Skjellum, Parallel Comput. 1996, 22, 789. [34] http://zlib.net [35] A. Scemama, ArXiv eprints [cs.SE], 0909.5012v1, 2009, arXiv:0909.5012v1, available at: http://arxiv.org/abs/0909.5012. Accessed December 21, 2012. [36] A. Monari, A. Scemama, M. Caffarel, In Remote Instrumentation for eScience and Related Aspects; F. Davoli, M. Lawenda, N. Meyer, R. Pugliese, J. Wglarz, S. Zappatore, Eds.; Springer: New York, 2012; pp. 195–207, ISBN 9781461405085, available at: http://dx.doi.org/ 10.1007/9781461405085_13. Accessed December 21, 2012.
Received: 28 September 2012 Revised: 27 November 2012 Accepted: 1 December 2012 Published online on 3 January 2013
Journal of Computational Chemistry 2013, 34, 938–951
951
IRPF90: a programming environment for high performan e
omputing Anthony S emama Laboratoire de Chimie et Physique Quantiques, CNRSUMR 5626, IRSAMC Université Paul Sabatier, 118 route de Narbonne 31062 Toulouse Cedex, Fran e
(Dated: O tober 22, 2009)
Abstra t
IRPF90 is a Fortran programming environment whi h helps the development of large Fortran
odes. In Fortran programs, the programmer has to fo us on the order of the instru tions: before using a variable, the programmer has to be sure that it has already been omputed in all possible situations. For large odes, it is ommon sour e of error. In IRPF90 most of the order of instru tions is handled by the prepro essor, and an automati me hanism guarantees that every entity is built before being used. This me hanism relies on the {needs/needed by} relations between the entities, whi h are built automati ally. Codes written with IRPF90 exe ute often faster than Fortran programs, are faster to write and easier to maintain.
1
I.
INTRODUCTION
The most popular programming languages in high performan e omputing (HPC) are those whi h produ e fast exe utables (Fortran and C for instan e). Large programs written in these languages are di ult to maintain and these languages are in onstant evolution to fa ilitate the development of large odes. For example, the C++ language[1℄ was proposed as an improvement of the C language by introdu ing lasses and other features of obje toriented programming. In this paper, we propose a Fortran prepro essor with a very limited number of keywords, whi h fa ilitates the development of large programs and the reusability of the ode without ae ting the e ien y. In the imperative programming paradigm, a omputation is a ordered list of ommands that hange the state of the program. At the lowest level, the ma hine ode is imperative: the ommands are the ma hine ode instru tions and the state of the program is represented by to the ontent of the memory. At a higher level, the Fortran language is an imperative language. Ea h statement of a Fortran program modies the state of the memory. In the fun tional programming paradigm, a omputation is the evaluation of a fun tion. This fun tion, to be evaluated, may need to evaluate other fun tions.
The state of the
program is not known by the programmer, and the memory management is handled by the
ompiler. Imperative languages are easy to understand by ma hines, while fun tional languages are easy to understand by human beings. Hen e, ode written in an imperative language an be made extremely e ient, and this is the main reason why Fortran and C are so popular in the eld of High Performan e Computing (HPC). However, odes written in imperative languages usually be ome ex essively ompli ated to maintain and to debug. In a large ode, it is often very di ult for the programmer to have a lear image of the state of the program at a given position of the ode, espe ially when sideee ts in a pro edure modiy memory lo ations whi h are used in other pro edures. In this paper, we present a tool alled Impli it Referen e to Parameters with Fortran 90 (IRPF90). It is a Fortran prepro essor whi h fa ilitates the development of large simulation
odes, by allowing the programmer to fo us on
what
is being omputed, instead of
how
it is
omputed. This last senten e often des ribes the dieren e between the fun tional and the imperative paradigms[2℄. From a pra ti al point of view, IRPF90 is a program written in the Python[3℄ language. It produ es Fortran sour e les from IRPF90 sour e les. IRPF90 sour e les are Fortran sour e les with a limited number of additional statements.
To
explain how to use the IRPF90 tool, we will write a simple mole ular dynami s program as a tutorial.
II.
TUTORIAL: A MOLECULAR DYNAMICS PROGRAM
A.
Imperative and fun tional implementation of the potential
We rst hoose to implement the LennardJones potential[4℄ to ompute the intera tion of pairs of atoms:
σ 12 σ 6 V (r) = 4ǫ − r r
2
(1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
program potential_with_imperative_style implicit none double precision :: sigma_lj, epsilon_lj double precision :: interatomic_distance double precision :: sigma_over_r double precision :: V_lj print *, 'Sigma?' read(*,*) sigma_lj print *, 'Epsilon?' read(*,*) epsilon_lj print *, 'Interatomic Distance?' read(*,*) interatomic_distance sigma_over_r = sigma_lj/interatomic_distance V_lj = 4.d0 * epsilon_lj * ( sigma_over_r**12 & − sigma_over_r**6 ) print *, 'Lennard−Jones potential:' print *, V_lj end program
FIG. 1: Imperative implementation of the LennardJones potential.
where r is the atomatom distan e, ǫ is the depth of the potential well and σ is the value of r for whi h the potential energy is zero. ǫ and σ are the parameters of the for e eld. Using an imperative style, one would obtain the program given in gure 1. One an
learly see the sequen e of statements in this program: rst read the data, then ompute the value of the potential. This program an be rewritten using a fun tional style, as shown in gure 2. In the fun tional form of the program, the sequen e of operations does not appear as learly as in the imperative example. Moreover, the order of exe ution of the ommands now depends on the hoi e of the ompiler: the fun tion sigma_over_r and the fun tion epsilon_lj are both alled on line 1213, and the order of exe ution may dier from one ompiler to the other. The program was written in su h a way that the fun tions have no arguments. The reason for this hoi e is that the referen es to the entities whi h are needed to al ulate a fun tion appear inside the fun tion, and not outside of the fun tion. Therefore, the
ode is simpler to understand for a programmer who never read this parti ular ode, and it an be easily represented as a produ tion tree (gure 3, above). This tree exhibits the relation {needs/needed by} between the entities of interest: the entity V_lj needs the entities sigma_over_r and epsilon_lj to be produ ed, and sigma_over_r needs sigma_lj and interatomi _distan e. In the imperative version of the ode (gure 1), the produ tion tree has to be known by the programmer so he an pla e the instru tions in the proper order. For simple programs it is not a problem, but for large odes the produ tion tree an be so large that the programmer is likely to make wrong assumptions in the dependen ies between the entities. This omplexies the stru ture of the ode by the introdu tion of many dierent methods to ompute the same quantity, and the performan e of the ode an be redu ed due to the omputation of entities whi h are not needed. In the fun tional version (gure 2), the produ tion tree does not need to be known by the programmer. It exists impli itely through the fun tion alls, and the evaluation of the main fun tion is realized by exploring the tree with a depthrst algorithm. A large advantage of the fun tional style is that there an only be one way to al ulate the value of an entity: 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
program potential_with_functional_style double precision :: V_lj print *, V_lj() end program double precision function V_lj() double precision :: sigma_lj double precision :: epsilon_lj double precision :: interatomic_distance double precision :: sigma_over_r V_lj = 4.d0 * epsilon_lj() * & ( sigma_over_r()**12 − sigma_over_r()**6 ) end function double precision function epsilon_lj() print *, 'Epsilon?' read(*,*) epsilon_lj end function double precision function sigma_lj () print *, 'Sigma?' read(*,*) sigma_lj end function double precision function sigma_over_r() double precision :: sigma_lj double precision :: interatomic_distance sigma_over_r = sigma_lj()/interatomic_distance() end function double precision function interatomic_distance() print *, 'Interatomic Distance?' read(*,*) interatomic_distance end function
FIG. 2: Fun tional implementation of the LennardJones potential.
Vlj
epsilonlj
sigma_over_r
interatomic_distance
sigma_over_r
sigmalj
sigmalj
interatomic_distance
Vlj
epsilonlj
sigma_over_r
sigmalj
interatomic_distance
FIG. 3: The produ tion tree of V_lj. Above, the tree produ ed by the program of gure 2. Below, the tree obtained if only one all to sigma_over_r is made.
4
1 2 3 4 5 6 7 8 9 10 11
double precision function sigma_over_r() double precision :: sigma_lj double precision :: interatomic_distance double precision, save :: last_result integer, save :: first_time_here if (first_time_here.eq.0) then last_result = sigma_lj()/interatomic_distance() first_time_here = 1 endif sigma_over_r = last_result end function FIG. 4: Memoized
sigma_over_r fun tion
alling the orresponding fun tion. Therefore, the readability of the ode is improved for a programmer who is not familiar with the program. Moreover, as soon as an entity is needed, it is al ulated and valid. Writing programs in this way redu es onsiderably the risk to use uninitialized variables, or variables that are supposed to have a given value but whi h have been modied by a sideee t. With the fun tional example, every time a quantity is needed it is omputed, even if it has already been built before. If the fun tions are pure (with no sideee ts), one an implement memoization[5, 6℄ to redu e the omputational ost: the last value of the fun tion is saved, and if the fun tion is alled again with the same arguments the last result is returned instead of omputing it again. In the present example we hose to write fun tions with no arguments, so memoization is trivial to implement (gure 4). If we onsider that the leaves of the produ tion tree are onstant, memoization an be applied to all the fun tions. The produ tion tree of V_lj an now be simplied, as shown in gure 3, below. B.
Presentation of the IRPF90 statements
IRPF90 is a Fortran prepro essor: it generates Fortran ode from sour e les whi h
ontain keywords spe i to the IRPF90 program. The keywords understood by IRPF90 prepro essor are briey presented. They will be examplied in the next subse tions for the mole ular dynami s example. BEGIN_PROVIDER ...
END_PROVIDER
Delimitates the denition of a provider (se tions II C and II D). BEGIN_DOC ...
END_DOC
Delimitates the do umentation of the urrent provider (se tion II C). BEGIN_SHELL ...
END_SHELL
Delimitates an embedded s ript (se tion II E). ASSERT
Expresses an assertion (se tion II C). TOUCH
Expresses the modi ation of the value of an entity by a sideee t (se tion II F). FREE
Invalidates an entity and free the asso iated memory. (se tion ??). IRP_READ / IRP_WRITE
Reads/Writes the ontent of the produ tion tree to/from disk (se tion II G). 5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
program lennard_jones_dynamics print *, V_lj end program BEGIN_PROVIDER [ double precision, V_lj ] implicit none BEGIN_DOC ! Lennard Jones potential energy. END_DOC double precision :: sigma_over_r sigma_over_r = sigma_lj / interatomic_distance V_lj = 4.d0 * epsilon_lj * ( sigma_over_r**12 & − sigma_over_r**6 ) END_PROVIDER BEGIN_PROVIDER [ double precision, epsilon_lj ] &BEGIN_PROVIDER [ double precision, sigma_lj ] BEGIN_DOC ! Parameters of the Lennard−Jones potential END_DOC print *, 'Epsilon?' read(*,*) epsilon_lj ASSERT (epsilon_lj > 0.) print *, 'Sigma?' read(*,*) sigma_lj ASSERT (sigma_lj > 0.) END_PROVIDER BEGIN_PROVIDER[double precision,interatomic_distance] BEGIN_DOC ! Distance between the atoms END_DOC print *, 'Inter−atomic distance?' read (*,*) interatomic_distance ASSERT (interatomic_distance >= 0.) END_PROVIDER
FIG. 5: IRPF90 implementation of the LennardJones potential.
IRP_IF ...
IRP_ELSE ...
IRP_ENDIF
Delimitates blo ks for onditional ompilation (se tion II G). PROVIDE
Expli it all to the provider of an entity (se tion II G). C.
Implementation of the potential using IRPF90
In the IRPF90 environment, the entities of interest are the result of memoized fun tions with no arguments. This representation of the data allows its organization in a produ tion tree, whi h is built and handled by the IRPF90 prepro essor. The previous program may be written again using the IRPF90 environment, as shown in gure 5. The program shown in gure 5 is very similar to the fun tional program of gure 2. The dieren e is that the entities of interest are not fun tions anymore, but variables. The variable orresponding to an entity is provided by alling a providing pro edure (or provider), dened between the keywords BEGIN_PROVIDER ... END_PROVIDER. In the IRPF90 environment, a provider an provide several entities (as shown with the parameters of the potential), although it is preferable to have providers that provide only one entity. 6
When an entity has been built, it is tagged as built. Hen e, the next all to the provider will return the last omputed value, and will not build the value again. This explains why in the IRPF90 environment the parameters of the for e eld are asked only on e to the user. The
ASSERT
keyword was introdu ed to allow the user to pla e assertions[9℄ in the ode.
An assertion spe ies ertain general properties of a value.
It is expressed as a logi al
expression whi h is supposed to be always true. If it is not, the program is wrong. Assertions in the ode provide runtime he ks whi h an dramati ally redu e the time spent nding bugs: if an assertion is not veried, the program stops with a message telling the user whi h assertion aused the program to fail. The
BEGIN_DOC ...
END_DOC blo ks ontain the do umentation of the provided entities.
The des riptions are en apsulated inside these blo ks in order to fa ilitate the generation of te hni al do umentation.
For ea h entity a man page is reated, whi h ontains the
BEGIN_DOC irpman ommand
{needs/needed by} dependen ies of the entity and the des ription given in the
...
END_DOC
blo k. This do umentation an be a
essed by using the
followed by the name of the entity. The IRPF90 environment was reated to simplify the work of the s ienti programmer. A lot of time is spent reating Makeles, whi h des ribe the dependen ies between the sour e les for the ompilation.
As the IRPF90 tool knows the produ tion tree, it an build
automati ally the Makeles of programs, without any intera tion with the user. When the user starts a proje t, he runs the ommand
irpf90 init in an empty dire tory.
Makele is reated, with the gfortran ompiler[10℄ as a default.
A standard
Then, the user starts to
write IRPF90 les whi h ontain providers, subroutines, fun tions and main programs in les hara terized by the
.irp.f
sux. Running
make
irpf90,
alls
and a orre t Makele
is automati ally produ ed and used to ompile the ode.
D.
Providing arrays
Now the basi s of IRPF90 are known to the reader, we an show how simple it is to write a mole ular dynami s program.
As we will ompute the intera tion of several atoms, we
will hange the previous program su h that we produ e an array of potential energies per atom. We rst need to introdu e the quantity
Natoms whi h ontains the number of atoms.
Figure 6 shows the ode whi h denes the geometri al parameters of the system. Figure 7 shows the providers orresponding to the potential energy
V
per atom
i,
where it is hosen
equal to the LennardJones potential energy:
Vi = ViLJ =
NX atoms j6=i
4ǫ
"
σ rij 
12
−
σ rij 
6 #
Figure 8 shows the providers orresponding to the kineti energy
1 Ti = mi vi 2 2 where
mi
is the mass and
vi
is the velo ity ve tor of atom
T
(2)
per atom
i: (3)
i.
The velo ity ve tor is hosen
to be initialized zero. The dimensions of arrays are given in the denition of the provider. If an entity, denes the dimension of an array, the provider of the dimensioning entity will be alled before allo ating the array. This guarantees that the array will always be allo ated with the proper
7
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
BEGIN_PROVIDER [ integer, Natoms ] BEGIN_DOC ! Number of atoms END_DOC print *, 'Number of atoms?' read(*,*) Natoms ASSERT (Natoms > 0) END_PROVIDER BEGIN_PROVIDER [ double precision, coord, (3,Natoms) ] &BEGIN_PROVIDER [ double precision, mass , (Natoms) ] implicit none BEGIN_DOC ! Atomic data, input in atomic units. END_DOC integer :: i,j print *, 'For each atom: x, y, z, mass?' do i=1,Natoms read(*,*) (coord(j,i), j=1,3), mass(i) ASSERT (mass(i) > 0.) enddo END_PROVIDER BEGIN_PROVIDER[double precision,distance,(Natoms,Natoms)] implicit none BEGIN_DOC ! distance : Distance matrix of the atoms END_DOC integer :: i,j,k do i=1,Natoms do j=1,Natoms distance(j,i) = 0. do k=1,3 distance(j,i) = distance(j,i) + & (coord(k,i)−coord(k,j))**2 enddo distance(j,i) = sqrt(distance(j,i)) enddo enddo END_PROVIDER
FIG. 6: Code dening the geometri al parameters of the system size. In IRPF90, the memory allo ation of an array entity is not written by the user, but by the prepro essor. Memory an be expli itely freed using the keyword array
velo ity
would be done using
FREE velo ity.
FREE.
For example, deallo ating the
If the memory of an entity is freed,
the entity is tagged as not built, and it will be allo ated and built again the next time it is needed.
E.
Embedding s ripts
The IRPF90 environment allows the programmer to write s ripts inside his ode. The s ripting language that will interpret the s ript is given in bra kets. The result of the shell s ript will be inserted in the le, and then will be interpreted by the Fortran prepro essor. Su h s ripts an be used to write templates, or to write in the ode some information that has to be retrieved at ompilation. For example, the date when the ode was ompiled an
8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
BEGIN_PROVIDER [ double precision, V, (Natoms) ] BEGIN_DOC ! Potential energy. END_DOC integer :: i do i=1,Natoms V(i) = V_lj(i) enddo END_PROVIDER BEGIN_PROVIDER [ double precision, V_lj, (Natoms) ] implicit none BEGIN_DOC ! Lennard Jones potential energy. END_DOC integer :: i,j double precision :: sigma_over_r do i=1,Natoms V_lj(i) = 0. do j=1,Natoms if ( i /= j ) then ASSERT (distance(j,i) > 0.) sigma_over_r = sigma_lj / distance(j,i) V_lj(i) = V_lj(i) + sigma_over_r**12 & − sigma_over_r**6 endif enddo V_lj(i) = 4.d0 * epsilon_lj * V_lj(i) enddo END_PROVIDER BEGIN_PROVIDER [ double precision, epsilon_lj ] &BEGIN_PROVIDER [ double precision, sigma_lj ] BEGIN_DOC ! Parameters of the Lennard−Jones potential END_DOC print *, 'Epsilon?' read(*,*) epsilon_lj ASSERT (epsilon_lj > 0.) print *, 'Sigma?' read(*,*) sigma_lj ASSERT (sigma_lj > 0.) END_PROVIDER
FIG. 7: Denition of the potential. be inserted in the sour e ode using the example given in gure 9. In our mole ular dynami s program, the total kineti energy the elements of the kineti energy ve tor
T:
Ekin =
NX atoms
E_kin
is the sum over all
Ti
(4)
i=1
Similarly, the potential energy
E_pot
is the sum of all the potential energies per atom.
Epot =
NX atoms
Vi
(5)
i=1
The ode to build
E_kin and E_pot is very lose:
only the names of the variables hange, and
it is onvenient to write the ode using a unique template for both quantities, as shown in
9
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
BEGIN_PROVIDER [ double precision, T, (Natoms) ] BEGIN_DOC ! Kinetic energy per atom END_DOC integer :: i do i=1,Natoms T(i) = 0.5d0 * mass(i) * velocity2(i) enddo END_PROVIDER BEGIN_PROVIDER[double precision,velocity2,(Natoms)] BEGIN_DOC ! Square of the norm of the velocity per atom END_DOC integer :: i, k do i=1,Natoms velocity2(i) = 0.d0 do k=1,3 velocity2(i) = velocity2(i) + velocity(k,i)**2 enddo enddo END_PROVIDER BEGIN_PROVIDER[double precision,velocity,(3,Natoms)] BEGIN_DOC ! Velocity vector per atom END_DOC integer :: i, k do i=1,Natoms do k=1,3 velocity(k,i) = 0.d0 enddo enddo END_PROVIDER
FIG. 8: Denition of the kineti energy. 1 2 3 4 5
program print_the_date BEGIN_SHELL [ /bin/sh ] echo print *, \'Compiled by $USER on `date`\' END_SHELL end program
FIG. 9: Embedded shell s ript whi h gets the date of ompilation. gure 10. In this way, adding a new property whi h is the sum over all the atomi properties
an done be done in only one line of ode: adding the triplet (Property, Do umentation, Atomi Property) to the list of entities at line 15.
F. Changing the value of an entity by a ontrolled sideee t Many omputer simulation programs ontain iterative pro esses. In an iterative pro ess, the same fun tion has to be al ulated at ea h step, but with dierent arguments. In our IRPF90 environment, at every iteration the produ tion tree is the same, but the values of some entities hange. To keep the program orre t, if the value of one entity is hanged it has to be tagged as built with its new value, and all the entities whi h depend on this
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
BEGIN_SHELL [ /usr/bin/python ] template = """ BEGIN_PROVIDER [ double precision, %(entity)s ] BEGIN_DOC ! %(doc)s END_DOC integer :: i %(entity)s = 0. do i=1,Natoms %(entity)s = %(entity)s+%(e_array)s(i) enddo END_PROVIDER """ entities = [ ("E_pot", "Potential Energy", "V"), ("E_kin", "Kinetic Energy", "T") ] for e in entities: dictionary = { "entity": e[0], "doc": e[1], "e_array": e[2]} print template%dictionary END_SHELL
FIG. 10: Providers of the LennardJones potential energy and the kineti energy using a template.
entity (dire tly or indire tly) need to be tagged as not built. These last entities will need to be re omputed during the new iteration. This me hanism is a hieved automati ally by the IRPF90 prepro essor using the keyword
TOUCH.
The sideee t modifying the value of
the entity is ontrolled, and the program will stay onsistent with the hange everywhere in the rest of the ode. In our program, we are now able to ompute the kineti and potential energy of the system. The next step is now to implement the dynami s. We hoose to use the velo ity Verlet algorithm[11℄:
rn+1 = rn + vn ∆t + an
∆t2 2
(6)
1 vn+1 = vn + (an + an+1 )∆t 2 where
rn
and
vn
are respe tively the position and velo ity ve tors at step
step and the a
eleration ve tor
a
NX atoms
−
i=1
1 ∇i Epot mi
The velo ity Verlet algorithm is written in a subroutine
∇Epot
n, ∆t
is the time
is dened as
a=
potential energy
(7)
(8)
verlet,
and the gradient of the
an be omputed by nite dieren e (gure 11).
i
Computing a omponent
of the numeri al gradient of
Epot
steps: 1. Change the omponent 2. Compute the value of
i
of the oordinate
ri −→ (ri + δ)
Epot
3. Change the oordinate
(ri + δ) −→ (ri − δ) 11
an be de omposed in six
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
BEGIN_PROVIDER [ double precision, dstep ] BEGIN_DOC ! Finite difference step END_DOC dstep = 1.d−4 END_PROVIDER BEGIN_PROVIDER[double precision,V_grad_numeric,(3,Natoms)] implicit none BEGIN_DOC ! Numerical gradient of the potential END_DOC integer :: i, k do i=1,Natoms do k=1,3 coord(k,i) = coord(k,i) + dstep TOUCH coord V_grad_numeric(k,i) = E_pot coord(k,i) = coord(k,i) − 2.d0*dstep TOUCH coord V_grad_numeric(k,i) = & ( V_grad_numeric(k,i)−E_pot )/(2.d0*dstep) coord(k,i) = coord(k,i) + dstep enddo enddo TOUCH coord END_PROVIDER BEGIN_PROVIDER [ double precision, V_grad, (3,Natoms) ] BEGIN_DOC ! Gradient of the potential END_DOC integer :: i,k do i=1,Natoms do k=1,3 V_grad(k,i) = V_grad_numeric(k,i) enddo enddo END_PROVIDER FIG. 11: Provider of the gradient of the potential.
4. Compute the value of
Epot
5. Compute the omponent of the gradient using the two last values of 6. Reset
(ri − δ) −→ ri
The provider of
V_grad_numeri
follows these steps: in the internal loop, the array
is hanged (line 16). Tou hing it (line 17) invalidates automati ally indire tly on
Epot
oord.
As the value of
E_pot,
oord
sin e it depends
E_pot is needed in line 18 and not valid, it is re omputed E_pot whi h is ae ted to V_grad_numeri (k,i)
between line 17 and line 18. The value of
is the value of the potential energy, onsistent with the urrent set of atomi oordinates. Then, the oordinates are hanged again (line 19), and the program is informed of this
hange at line 20. When the value of
E_pot
is used again at line 22, it is onsistent with
the last hange of oordinates. At line 23 the oordinates are hanged again, but no tou h statement follows. The reason for this hoi e is e ien y, sin e two ases are possible for the next instru tion: if we are at the last iteration of the loop, we exit the main loop and
12
line 26 is exe uted. Otherwise, the next instru tion will be line 16. Tou hing oord is not ne essary between line 23 and line 16 sin e no other entity is used. The important point is that the programmer doesn't have to know how E_pot depends on oord. He only has to apply a simple rule whi h states that when the value of an entity A is modied, it has to be tou hed before any other entity B is used. If B depends on A, it will be re omputed, otherwise it will not, and the ode will always be orre t. Using this method to ompute a numeri al gradient allows a programmer who is not familiar with the
ode to ompute the gradient of any entity A with respe t to any other quantity B , without even knowing if A depends on B . If A does not depend on B , the gradient will automati ally be zero. In the programs dealing with optimization problems, it is a real advantage: a short s ript an be written to build automati ally all the possible numeri al derivatives, involving all the entities of the program, as given in gure 12. The velo ity Verlet algorithm an be implemented (gure 13) as follows: 1. Compute the new value of the oordinates 2. Compute the omponent of the velo ities whi h depends on the old set of oordinates 3. Tou h the oordinates and the velo ities 4. In rement the velo ities by their omponent whi h depends on the new set of oordinates 5. Tou h the velo ities
G.
Other Features
As IRPF90 is designed for HPC, onditional ompilation is an essential requirement. Indeed, it is often used for a tivating and dea tivating blo ks of ode dening the behavior of the program under a parallel environment. This is a hieved by the IRP_IF...IRP_ELSE...IRP_ENDIF onstru ts. In gure 14, the he kpointing blo k is a tivated by running irpf90 DCHECKPOINT. If the D option is not present, the other blo k is a tivated. The urrent state of the produ tion tree an written to disk using the ommand IRP_WRITE as in gure 14. For ea h entity in the subtrees of E_pot and E_kin, a le is
reated with the name of the entity whi h ontains the value of the entity. The subtree an be loaded again later using the IRP_READ statement. This fun tionality is parti ularly useful for adding qui kly a he kpointing feature to an existing program. The PROVIDE keyword was added to assign imperatively a needs/needed by relation between two entities. This keyword an be used to asso iate the value of an entity to an iteration number in an iterative pro ess, or to help the prepro essor to produ e more e ient ode. A last onvenient feature was added: the de larations of the lo al variables do not need anymore to be lo ated before the rst exe utable statement. The lo al variables an now be de lared anywhere inside the providers, subroutines and fun tions. The IRPF90 prepro essor will put them at the beginning of the subroutines or fun tions for the programmer. It allows the user to de lare the variables where the reader needs to know to what they
orrespond. 13
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
BEGIN_SHELL [ /usr/bin/python ] # Read the names of the entities and their dimensions dims = {} import os for filename in os.listdir('.'): if filename.endswith('.irp.f'): file = open(filename,'r') for line in file: if "%" not in line: if line.strip().lower().startswith('begin_provider'): buffer = line.split(',',2) name = buffer[1].split(']')[0].strip() if len(buffer) == 2: dims[name] = [] else: dims[name] = buffer[2] for c in "()] \n": dims[name] = dims[name].replace(c,"") dims[name] = dims[name].split(",") file.close() # The template to use for the code generation template = """ BEGIN_PROVIDER[double precision, grad_%(var1)s_%(var2)s %(dims2)s] BEGIN_DOC ! Gradient of %(var1)s with respect to %(var2)s END_DOC integer :: %(all_i)s double precision :: two_dstep two_dstep = dstep + dstep %(do)s %(var2)s %(indice)s = %(var2)s %(indice)s + dstep TOUCH %(var2)s grad_%(var1)s_%(var2)s %(indice)s = %(var1)s %(var2)s %(indice)s = %(var2)s %(indice)s − two_dstep TOUCH %(var2)s grad_%(var1)s_%(var2)s %(indice)s = & (grad_%(var1)s_%(var2)s %(indice)s − %(var1)s)/two_dstep %(var2)s %(indice)s = %(var2)s %(indice)s + dstep %(enddo)s TOUCH %(var2)s END_PROVIDER """ # Generate all possibilities of d(v1)/d(v2), with v1 scalar for v1 in dims.keys(): if dims[v1] == []: for v2 in dims.keys(): if v2 != v1: do = "" enddo = "" if dims[v2] == []: dims2 = "" all_i = "i" indice = "" else: dims2 = ', ('+','.join(dims[v2])+')' all_i = ','.join([ "i"+str(k) for k in range(len(dims[v2]))]) indice = "(" for k,d in enumerate(dims[v2]): i = "i"+str(k) do = " do "+i+" = 1,"+d+"\n"+do enddo += " enddo\n" indice += i+"," indice = indice[:−1]+")" dictionary = {"var1" : v1, "var2" : v2, "dims2" : dims2, "all_i" : all_i, "do" : do, "indice": indice, "enddo" : enddo} print template%dictionary 14 END_SHELL
FIG. 12: Automati generation of all possible gradients of s alar entities with respe t to all other entities.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
BEGIN_PROVIDER [ integer, Nsteps ] BEGIN_DOC ! Number of steps for the dynamics END_DOC print *, 'Nsteps?' read(*,*) Nsteps ASSERT (Nsteps > 0) END_PROVIDER BEGIN_PROVIDER [ double precision, tstep ] &BEGIN_PROVIDER [ double precision, tstep2 ] BEGIN_DOC ! Time step for the dynamics END_DOC print *, 'Time step?' read(*,*) tstep ASSERT (tstep > 0.) tstep2 = tstep*tstep END_PROVIDER BEGIN_PROVIDER[double precision,acceleration,(3,Natoms)] implicit none BEGIN_DOC ! Acceleration = − grad(V)/m END_DOC integer :: i, k do i=1,Natoms do k=1,3 acceleration(k,i) = − V_grad(k,i)/mass(i) enddo enddo END_PROVIDER subroutine verlet implicit none integer :: is, i, k do is=1,Nsteps do i=1,Natoms do k=1,3 coord(k,i) = coord(k,i) + tstep*velocity(k,i) + & 0.5*tstep2*acceleration(k,i) velocity(k,i) = velocity(k,i) + 0.5*tstep* & acceleration(k,i) enddo enddo TOUCH coord velocity do i=1,Natoms do k=1,3 velocity(k,i) = velocity(k,i) + 0.5*tstep* & acceleration(k,i) enddo enddo TOUCH velocity call print_data(is) enddo end subroutine FIG. 13: The velo ity Verlet algorithm.
15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
program dynamics call verlet IRP_IF CHECKPOINT print *, 'Checkpoint' IRP_WRITE E_pot IRP_WRITE E_kin IRP_ELSE print *, 'No checkpoint' IRP_ENDIF end
FIG. 14: The main program.
III.
EFFICIENCY OF THE GENERATED CODE
In the laboratory, we are urrently rewriting a quantum Monte Carlo (QMC) program, named QMC=Chem, with the IRPF90 tool. The same omputation was realized with the old ode (usual Fortran ode), and the new ode (IRPF90 ode). Both odes were ompiled with the Intel Fortran ompiler version 11.1 using the same options.
A ben hmark was
realized on an Intel Xeon 5140 pro essor. The IRPF90 ode is faster than the old ode by a fa tor of 1.60: the CPU time of the IRPF90 exe utable is 62% of the CPU time of the old ode. This time redu tion is mainly due to the avoidan e of omputing quantities that are already omputed. The total number of pro essor instru tions is therefore redu ed. The average number of instru tions per pro essor y le is 1.47 for the old ode, and 1.81 for the IRPF90 ode. This appli ation shows that even if the unne essary omputations were removed from the old ode, the ode produ ed by IRPF90 would still be more e ient. The reason is that in IRPF90, the programmer is guided to write e ient ode: the providers are small subroutines that manipulate a very limited number of memory lo ations.
This
oding style improves the temporal lo ality of the ode[12℄ and thus minimizes the number of a he misses. The on lusion of this realsize appli ation is that the overhead due to the management of the produ tion tree is negligible ompared to the e ien y gained by avoiding to ompute many times the same quantity, and by helping the Fortran ompiler to produ e optimized
ode.
IV.
SUMMARY
The IRPF90 environment is proposed for writing programs with redu ed omplexity. This te hnique for writing programs, alled Impli it Referen e to Parameters (IRP),[7℄ is
onform to the re ommendations of the Open Stru ture Interfa eable Programming Environment (OSIPE)[8℄:
•
Open: Unambiguous identi ation and a
ess to any entity anywhere in the program
16
• Interfa eable: Easy addition of any new feature to an existing ode • Stru tured: The additions will have no ee t on the program logi
The programming paradigm uses some ideas of fun tional programming and thus laries the orrespondan e between the mathemati al formulas and the ode. Therefore, s ientists do not need to be experts in programming to write lear, reusable and e ient ode, as shown with the simple mole ular dynami s ode presented in this paper. The onsequen es of the lo ality of the ode are multiple: • the ode is e ient sin e the temporal lo ality is in reased, • the overlap of pie es of ode written simultaneously by multiple developers is redu ed. • regression testing[13℄ an be a hieved by writing, for ea h entity, a program whi h
tests that the entity is built orre tly.
Finally, let us mention that the IRPF90 prepro essor generates Fortran 90 whi h is fully
ompatible with standard subroutines and fun tions. Therefore the produ ed Fortran ode
an be ompiled on any ar hite ture, and the usual HPC libraries (BLAS[14℄, LAPACK[15℄, MPI[16℄,. . . ) an be used. The IRPF90 program an be downloaded on http://irpf90.sour eforge.net A knowledgments
The author would like to a knowledge F. Colonna (CNRS, Paris) for tea hing him the IRP method, and long dis ussions around this subje t. The author also would like to thank P. Reinhardt (Université Pierre et Marie Curie, Paris) for testing and enjoying the IRPF90 tool, and F. Spiegelman (Université Paul Sabatier, Toulouse) for dis ussions about the mole ular dynami s ode.
[1℄ Stroustrup B. (2000). [2℄ Hudak P.
The C++ Programming Language Ed:
AddisonWesley Pub Co; 3rd edition
ACM Comput. Surv. 21(3), 359 (1989).
[3℄ http://www.python.org/ [4℄ LennardJones J. E., [5℄ Mi hie D.
Pro eedings of the Physi al So iety 43, 461 (1931).
Nature 218 19 (1968).
[6℄ Hughes R.J.M. Lazy memo fun tions in:
G. Goos and J. Hartmanis, eds., Pro . Conf:
on Fun tional Programming and Computer Ar hite ture, Nan y, Fran e, September 1985, Springer Le ture Note Series, Vol. 201 (Springer, Berlin, 1985). [7℄ http://galileo.l t.jussieu.fr/ frames/mediawiki/index.php/IRP_Programming_Presentation [8℄ Colonna F., Jolly L.H., Poirier R. A., Ángyán J. G., and Jansen G.
81(3),
Comp. Phys. Comm.
293 (1994).
[9℄ Hoare C.A.R.,
Commun. ACM, 12(10), 576 (1969).
[10℄ http://g
.gnu.org/fortran/ [11℄ Swope W. C., Andersen H. C., Berens P. H., and Wilson K. R.
17
J. Chem. Phys. 76, 637 (1982).
[12℄ Denning P. J. Commun. ACM 48(7), 19 (2005). [13℄ Agrawal H., Horgan J. R., Krauser, E.W., London, S., In remental regression testing. in: Pro eedings of the IEEE Conferen e on Software Maintenan e, 348 (1993). [14℄ L. S. Bla kford, J. Demmel, J. Dongarra, I. Du, S. Hammarling, G. Henry, M. Heroux, L. Kaufman, A. Lumsdaine, A. Petitet, R. Pozo, K. Remington, R. C. Whaley, ACM Trans. Math. Soft. 28(2), 135 (2002). [15℄ Anderson E., Bai Z., Bis hof C., Bla kford S., Demmel J., Dongarra J., Du Croz J., Greenbaum A., Hammarling S., M Kenney A. and Sorensen D. LAPACK Users' Guide, Ed: So iety for Industrial and Applied Mathemati s, Philadelphia, PA, (1999). [16℄ Gropp W., Lusk E., Doss N. and A. Skjellum, Parallel Computing 22(6), 789 (1996).
18