Re: [OMPI users] scaling problem with openmpi

Rolf Vandevaart Wed, 20 May 2009 08:58:43 -0400

The correct MCA parameters are the following:
-mca coll_tuned_use_dynamic_rules 1
-mca coll_tuned_dynamic_rules_filename ./dyn_rules


You can also run the following command:
ompi_info -mca coll_tuned_use_dynamic_rules 1 -param coll tuned

This will give some insight into all the various algorithms that make upthe tuned collectives.

If I am understanding what is happening, it looks like the originalMPI_Alltoall made use of three algorithms. (You can look incoll_tuned_decision_fixed.c)


If message size < 200 or communicator size > 12
  bruck
else if message size < 3000
  basic linear
else
  pairwise
end

With the file Pavel has provided things have changed to the following.(maybe someone can confirm)


If message size < 8192
  bruck
else
  pairwise
end

Rolf


On 05/20/09 07:48, Roman Martonak wrote:

Many thanks for the highly helpful analysis. Indeed, what Peter says
seems to be precisely the case here. I tried to run the 32 waters test
on 48 cores now, with the original cutoff of 100 Ry, and with slightly
increased one of 110 Ry. Normally with larger cutoff it should
obviously take more time for one step. Increasing cutoff however also
increases the size of the data buffer and it appears just to cross the
packet size threshold for different behaviour (test was ran with
openmpi-1.3.2).

--------------------------------------------------------------------------------------------------------------------------------------------------------
cutoff 100Ry

time per 1 step is 2.869 s

= ALL TO ALL COMM 151583. BYTES 2211. =
= ALL TO ALL COMM 16.741 MB/S 20.020 SEC =

--------------------------------------------------------------------------------------------------------------------------------------------------------
cutoff 110 Ry

time per 1 step is 1.879 s

= ALL TO ALL COMM 167057. BYTES 2211. =
= ALL TO ALL COMM 43.920 MB/S 8.410 SEC =

--------------------------------------------------------------------------------------------------------------------------------------------------------
so it actually runs much faster and ALL TO ALL COMM is 2.6 times
faster. In my case the threshold seems to be somewhere between
167057/48 = 3 480 and 151583/48 = 3 157 bytes.

I saved the text that Pavel suggested

1 # num of collectives
3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
1 # number of com sizes
64 # comm size 8
2 # number of msg sizes
0 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
# end of first collective

to the file dyn_rules and tried to run appending the options
"--mca use_dynamic_rules 1 --mca dynamic_rules_filename ./dyn_rules" to mpirun
but it does not make any change. Is this the correct syntax to enable
the rules ?
And will the above sample file shift the threshold to lower values (to
what value) ?

Best regards

Roman

On Wed, May 20, 2009 at 10:39 AM, Peter Kjellstrom <c...@nsc.liu.se> wrote:

On Tuesday 19 May 2009, Peter Kjellstrom wrote:

On Tuesday 19 May 2009, Roman Martonak wrote:

On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom <c...@nsc.liu.se> wrote:

On Tuesday 19 May 2009, Roman Martonak wrote:
...

openmpi-1.3.2                           time per one MD step is 3.66 s
   ELAPSED TIME :    0 HOURS  1 MINUTES 25.90 SECONDS
 = ALL TO ALL COMM           102033. BYTES               4221.  =
 = ALL TO ALL COMM             7.802  MB/S          55.200 SEC  =

...

With TASKGROUP=2 the summary looks as follows

...

 = ALL TO ALL COMM           231821. BYTES               4221.  =
 = ALL TO ALL COMM            82.716  MB/S          11.830 SEC  =

Wow, according to this it takes 1/5th the time to do the same number (4221)
of alltoalls if the size is (roughly) doubled... (ten times better
performance with the larger transfer size)

Something is not quite right, could you possibly try to run just the
alltoalls like I suggested in my previous e-mail?

I was curious so I ran som tests. First it seems that the size reported by
CPMD is the total size of the data buffer not the message size. Running
alltoalls with 231821/64 and 102033/64 gives this (on a similar setup):

bw for   4221    x 1595 B :  36.5 Mbytes/s       time was:  23.3 s
bw for   4221    x 3623 B : 125.4 Mbytes/s       time was:  15.4 s
bw for   4221    x 1595 B :  36.4 Mbytes/s       time was:  23.3 s
bw for   4221    x 3623 B : 125.6 Mbytes/s       time was:  15.3 s

So it does seem that OpenMPI has some problems with small alltoalls. It is
obviously broken when you can get things across faster by sending more...

As a reference I ran with a commercial MPI using the same program and node-set
(I did not have MVAPICH nor IntelMPI on this system):

bw for   4221    x 1595 B :  71.4 Mbytes/s       time was:  11.9 s
bw for   4221    x 3623 B : 125.8 Mbytes/s       time was:  15.3 s
bw for   4221    x 1595 B :  71.1 Mbytes/s       time was:  11.9 s
bw for   4221    x 3623 B : 125.5 Mbytes/s       time was:  15.3 s

To see when OpenMPI falls over I ran with an increasing packet size:

bw for   10      x 2900 B :  59.8 Mbytes/s       time was:  61.2 ms
bw for   10      x 2925 B :  59.2 Mbytes/s       time was:  62.2 ms
bw for   10      x 2950 B :  59.4 Mbytes/s       time was:  62.6 ms
bw for   10      x 2975 B :  58.5 Mbytes/s       time was:  64.1 ms
bw for   10      x 3000 B : 113.5 Mbytes/s       time was:  33.3 ms
bw for   10      x 3100 B : 116.1 Mbytes/s       time was:  33.6 ms

The problem seems to be for packets with 1000Bytes < size < 3000Bytes with a
hard edge at 3000Bytes. Your CPMD was communicating at more or less the worst
case packet size.

These are the figures for my "reference" MPI:

bw for   10      x 2900 B : 110.3 Mbytes/s       time was:  33.1 ms
bw for   10      x 2925 B : 110.4 Mbytes/s       time was:  33.4 ms
bw for   10      x 2950 B : 111.5 Mbytes/s       time was:  33.3 ms
bw for   10      x 2975 B : 112.4 Mbytes/s       time was:  33.4 ms
bw for   10      x 3000 B : 118.2 Mbytes/s       time was:  32.0 ms
bw for   10      x 3100 B : 114.1 Mbytes/s       time was:  34.2 ms

Setup-details:
hw: dual socket quad core harpertowns with ConnectX IB and 1:1 2-level tree
sw: CentOS-5.3 x86_64 with OpenMPI-1.3b2 (did not have time to try 1.3.2) on
OFED from CentOS (1.3.2-ish I think).

/Peter

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================

Re: [OMPI users] scaling problem with openmpi

Reply via email to