On Tuesday 19 May 2009, Peter Kjellstrom wrote:
> On Tuesday 19 May 2009, Roman Martonak wrote:
> > On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom <c...@nsc.liu.se> wrote:
> > > On Tuesday 19 May 2009, Roman Martonak wrote:
> > > ...
> > >
> > >> openmpi-1.3.2                           time per one MD step is 3.66 s
> > >>    ELAPSED TIME :    0 HOURS  1 MINUTES 25.90 SECONDS
> > >>  = ALL TO ALL COMM           102033. BYTES               4221.  =
> > >>  = ALL TO ALL COMM             7.802  MB/S          55.200 SEC  =
>
> ...
>
> > With TASKGROUP=2 the summary looks as follows
>
> ...
>
> >  = ALL TO ALL COMM           231821. BYTES               4221.  =
> >  = ALL TO ALL COMM            82.716  MB/S          11.830 SEC  =
>
> Wow, according to this it takes 1/5th the time to do the same number (4221)
> of alltoalls if the size is (roughly) doubled... (ten times better
> performance with the larger transfer size)
>
> Something is not quite right, could you possibly try to run just the
> alltoalls like I suggested in my previous e-mail?

I was curious so I ran som tests. First it seems that the size reported by 
CPMD is the total size of the data buffer not the message size. Running 
alltoalls with 231821/64 and 102033/64 gives this (on a similar setup):

bw for   4221    x 1595 B :  36.5 Mbytes/s       time was:  23.3 s
bw for   4221    x 3623 B : 125.4 Mbytes/s       time was:  15.4 s
bw for   4221    x 1595 B :  36.4 Mbytes/s       time was:  23.3 s
bw for   4221    x 3623 B : 125.6 Mbytes/s       time was:  15.3 s

So it does seem that OpenMPI has some problems with small alltoalls. It is 
obviously broken when you can get things across faster by sending more...

As a reference I ran with a commercial MPI using the same program and node-set 
(I did not have MVAPICH nor IntelMPI on this system):

bw for   4221    x 1595 B :  71.4 Mbytes/s       time was:  11.9 s
bw for   4221    x 3623 B : 125.8 Mbytes/s       time was:  15.3 s
bw for   4221    x 1595 B :  71.1 Mbytes/s       time was:  11.9 s
bw for   4221    x 3623 B : 125.5 Mbytes/s       time was:  15.3 s

To see when OpenMPI falls over I ran with an increasing packet size:

bw for   10      x 2900 B :  59.8 Mbytes/s       time was:  61.2 ms
bw for   10      x 2925 B :  59.2 Mbytes/s       time was:  62.2 ms
bw for   10      x 2950 B :  59.4 Mbytes/s       time was:  62.6 ms
bw for   10      x 2975 B :  58.5 Mbytes/s       time was:  64.1 ms
bw for   10      x 3000 B : 113.5 Mbytes/s       time was:  33.3 ms
bw for   10      x 3100 B : 116.1 Mbytes/s       time was:  33.6 ms

The problem seems to be for packets with 1000Bytes < size < 3000Bytes with a 
hard edge at 3000Bytes. Your CPMD was communicating at more or less the worst 
case packet size.

These are the figures for my "reference" MPI:

bw for   10      x 2900 B : 110.3 Mbytes/s       time was:  33.1 ms
bw for   10      x 2925 B : 110.4 Mbytes/s       time was:  33.4 ms
bw for   10      x 2950 B : 111.5 Mbytes/s       time was:  33.3 ms
bw for   10      x 2975 B : 112.4 Mbytes/s       time was:  33.4 ms
bw for   10      x 3000 B : 118.2 Mbytes/s       time was:  32.0 ms
bw for   10      x 3100 B : 114.1 Mbytes/s       time was:  34.2 ms

Setup-details:
hw: dual socket quad core harpertowns with ConnectX IB and 1:1 2-level tree
sw: CentOS-5.3 x86_64 with OpenMPI-1.3b2 (did not have time to try 1.3.2) on 
OFED from CentOS (1.3.2-ish I think).

/Peter

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to