Default algorithm thresholds in mvapich are different from ompi.
Using tunned collectives in Open MPI you may configure the Open MPI Alltoall threshold as Mvapich defaults. The follow mca parameters configure Open MPI to use custom rules that are defined in configure(txt) file.
"--mca use_dynamic_rules 1 --mca dynamic_rules_filename"

Here is example of dynamic_rules_filename that should make Ompi Alltoall tuning similar to Mvapich:
1 # num of collectives
3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
1 # number of com sizes
64 # comm size 8
2 # number of msg sizes
0 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
# end of first collective


Thanks,
Pasha

Peter Kjellstrom wrote:
On Tuesday 19 May 2009, Peter Kjellstrom wrote:
On Tuesday 19 May 2009, Roman Martonak wrote:
On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom <c...@nsc.liu.se> wrote:
On Tuesday 19 May 2009, Roman Martonak wrote:
...

openmpi-1.3.2                           time per one MD step is 3.66 s
   ELAPSED TIME :    0 HOURS  1 MINUTES 25.90 SECONDS
 = ALL TO ALL COMM           102033. BYTES               4221.  =
 = ALL TO ALL COMM             7.802  MB/S          55.200 SEC  =
...

With TASKGROUP=2 the summary looks as follows
...

 = ALL TO ALL COMM           231821. BYTES               4221.  =
 = ALL TO ALL COMM            82.716  MB/S          11.830 SEC  =
Wow, according to this it takes 1/5th the time to do the same number (4221)
of alltoalls if the size is (roughly) doubled... (ten times better
performance with the larger transfer size)

Something is not quite right, could you possibly try to run just the
alltoalls like I suggested in my previous e-mail?

I was curious so I ran som tests. First it seems that the size reported by CPMD is the total size of the data buffer not the message size. Running alltoalls with 231821/64 and 102033/64 gives this (on a similar setup):

bw for   4221    x 1595 B :  36.5 Mbytes/s       time was:  23.3 s
bw for   4221    x 3623 B : 125.4 Mbytes/s       time was:  15.4 s
bw for   4221    x 1595 B :  36.4 Mbytes/s       time was:  23.3 s
bw for   4221    x 3623 B : 125.6 Mbytes/s       time was:  15.3 s

So it does seem that OpenMPI has some problems with small alltoalls. It is obviously broken when you can get things across faster by sending more...

As a reference I ran with a commercial MPI using the same program and node-set (I did not have MVAPICH nor IntelMPI on this system):

bw for   4221    x 1595 B :  71.4 Mbytes/s       time was:  11.9 s
bw for   4221    x 3623 B : 125.8 Mbytes/s       time was:  15.3 s
bw for   4221    x 1595 B :  71.1 Mbytes/s       time was:  11.9 s
bw for   4221    x 3623 B : 125.5 Mbytes/s       time was:  15.3 s

To see when OpenMPI falls over I ran with an increasing packet size:

bw for   10      x 2900 B :  59.8 Mbytes/s       time was:  61.2 ms
bw for   10      x 2925 B :  59.2 Mbytes/s       time was:  62.2 ms
bw for   10      x 2950 B :  59.4 Mbytes/s       time was:  62.6 ms
bw for   10      x 2975 B :  58.5 Mbytes/s       time was:  64.1 ms
bw for   10      x 3000 B : 113.5 Mbytes/s       time was:  33.3 ms
bw for   10      x 3100 B : 116.1 Mbytes/s       time was:  33.6 ms

The problem seems to be for packets with 1000Bytes < size < 3000Bytes with a hard edge at 3000Bytes. Your CPMD was communicating at more or less the worst case packet size.

These are the figures for my "reference" MPI:

bw for   10      x 2900 B : 110.3 Mbytes/s       time was:  33.1 ms
bw for   10      x 2925 B : 110.4 Mbytes/s       time was:  33.4 ms
bw for   10      x 2950 B : 111.5 Mbytes/s       time was:  33.3 ms
bw for   10      x 2975 B : 112.4 Mbytes/s       time was:  33.4 ms
bw for   10      x 3000 B : 118.2 Mbytes/s       time was:  32.0 ms
bw for   10      x 3100 B : 114.1 Mbytes/s       time was:  34.2 ms

Setup-details:
hw: dual socket quad core harpertowns with ConnectX IB and 1:1 2-level tree
sw: CentOS-5.3 x86_64 with OpenMPI-1.3b2 (did not have time to try 1.3.2) on OFED from CentOS (1.3.2-ish I think).

/Peter
------------------------------------------------------------------------

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to