On Tuesday 19 May 2009, Peter Kjellstrom wrote: > On Tuesday 19 May 2009, Roman Martonak wrote: > > On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom <c...@nsc.liu.se> wrote: > > > On Tuesday 19 May 2009, Roman Martonak wrote: > > > ... > > > > > >> openmpi-1.3.2 time per one MD step is 3.66 s > > >> ELAPSED TIME : 0 HOURS 1 MINUTES 25.90 SECONDS > > >> = ALL TO ALL COMM 102033. BYTES 4221. = > > >> = ALL TO ALL COMM 7.802 MB/S 55.200 SEC = > > ... > > > With TASKGROUP=2 the summary looks as follows > > ... > > > = ALL TO ALL COMM 231821. BYTES 4221. = > > = ALL TO ALL COMM 82.716 MB/S 11.830 SEC = > > Wow, according to this it takes 1/5th the time to do the same number (4221) > of alltoalls if the size is (roughly) doubled... (ten times better > performance with the larger transfer size) > > Something is not quite right, could you possibly try to run just the > alltoalls like I suggested in my previous e-mail?
I was curious so I ran som tests. First it seems that the size reported by CPMD is the total size of the data buffer not the message size. Running alltoalls with 231821/64 and 102033/64 gives this (on a similar setup): bw for 4221 x 1595 B : 36.5 Mbytes/s time was: 23.3 s bw for 4221 x 3623 B : 125.4 Mbytes/s time was: 15.4 s bw for 4221 x 1595 B : 36.4 Mbytes/s time was: 23.3 s bw for 4221 x 3623 B : 125.6 Mbytes/s time was: 15.3 s So it does seem that OpenMPI has some problems with small alltoalls. It is obviously broken when you can get things across faster by sending more... As a reference I ran with a commercial MPI using the same program and node-set (I did not have MVAPICH nor IntelMPI on this system): bw for 4221 x 1595 B : 71.4 Mbytes/s time was: 11.9 s bw for 4221 x 3623 B : 125.8 Mbytes/s time was: 15.3 s bw for 4221 x 1595 B : 71.1 Mbytes/s time was: 11.9 s bw for 4221 x 3623 B : 125.5 Mbytes/s time was: 15.3 s To see when OpenMPI falls over I ran with an increasing packet size: bw for 10 x 2900 B : 59.8 Mbytes/s time was: 61.2 ms bw for 10 x 2925 B : 59.2 Mbytes/s time was: 62.2 ms bw for 10 x 2950 B : 59.4 Mbytes/s time was: 62.6 ms bw for 10 x 2975 B : 58.5 Mbytes/s time was: 64.1 ms bw for 10 x 3000 B : 113.5 Mbytes/s time was: 33.3 ms bw for 10 x 3100 B : 116.1 Mbytes/s time was: 33.6 ms The problem seems to be for packets with 1000Bytes < size < 3000Bytes with a hard edge at 3000Bytes. Your CPMD was communicating at more or less the worst case packet size. These are the figures for my "reference" MPI: bw for 10 x 2900 B : 110.3 Mbytes/s time was: 33.1 ms bw for 10 x 2925 B : 110.4 Mbytes/s time was: 33.4 ms bw for 10 x 2950 B : 111.5 Mbytes/s time was: 33.3 ms bw for 10 x 2975 B : 112.4 Mbytes/s time was: 33.4 ms bw for 10 x 3000 B : 118.2 Mbytes/s time was: 32.0 ms bw for 10 x 3100 B : 114.1 Mbytes/s time was: 34.2 ms Setup-details: hw: dual socket quad core harpertowns with ConnectX IB and 1:1 2-level tree sw: CentOS-5.3 x86_64 with OpenMPI-1.3b2 (did not have time to try 1.3.2) on OFED from CentOS (1.3.2-ish I think). /Peter
signature.asc
Description: This is a digitally signed message part.