Many thanks for the highly helpful analysis. Indeed, what Peter says
seems to be precisely the case here. I tried to run the 32 waters test
on 48 cores now, with the original cutoff of 100 Ry, and with slightly
increased one of 110 Ry. Normally with larger cutoff it should
obviously take more time for one step. Increasing cutoff however also
increases the size of the data buffer and it appears just to cross the
packet size threshold for different behaviour (test was ran with
openmpi-1.3.2).

--------------------------------------------------------------------------------------------------------------------------------------------------------
cutoff 100Ry

time per 1 step is 2.869 s

 = ALL TO ALL COMM           151583. BYTES               2211.  =
= ALL TO ALL COMM            16.741  MB/S          20.020 SEC  =

--------------------------------------------------------------------------------------------------------------------------------------------------------
cutoff 110 Ry

time per 1 step is 1.879 s

 = ALL TO ALL COMM           167057. BYTES               2211.  =
 = ALL TO ALL COMM            43.920  MB/S           8.410 SEC  =

--------------------------------------------------------------------------------------------------------------------------------------------------------
so it actually runs much faster and  ALL TO ALL COMM is 2.6 times
faster. In my case the threshold seems to be somewhere between
167057/48 = 3 480 and 151583/48 = 3 157 bytes.

I saved the text that Pavel suggested

1 # num of collectives
3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
1 # number of com sizes
64 # comm size 8
2 # number of msg sizes
0 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
# end of first collective

to the file dyn_rules and tried to run appending the options
"--mca use_dynamic_rules 1 --mca dynamic_rules_filename ./dyn_rules" to mpirun
but it does not make any change. Is this the correct syntax to enable
the rules ?
And will the above sample file shift the threshold to lower values (to
what value) ?

Best regards

Roman

On Wed, May 20, 2009 at 10:39 AM, Peter Kjellstrom <c...@nsc.liu.se> wrote:
> On Tuesday 19 May 2009, Peter Kjellstrom wrote:
>> On Tuesday 19 May 2009, Roman Martonak wrote:
>> > On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom <c...@nsc.liu.se> wrote:
>> > > On Tuesday 19 May 2009, Roman Martonak wrote:
>> > > ...
>> > >
>> > >> openmpi-1.3.2                           time per one MD step is 3.66 s
>> > >>    ELAPSED TIME :    0 HOURS  1 MINUTES 25.90 SECONDS
>> > >>  = ALL TO ALL COMM           102033. BYTES               4221.  =
>> > >>  = ALL TO ALL COMM             7.802  MB/S          55.200 SEC  =
>>
>> ...
>>
>> > With TASKGROUP=2 the summary looks as follows
>>
>> ...
>>
>> >  = ALL TO ALL COMM           231821. BYTES               4221.  =
>> >  = ALL TO ALL COMM            82.716  MB/S          11.830 SEC  =
>>
>> Wow, according to this it takes 1/5th the time to do the same number (4221)
>> of alltoalls if the size is (roughly) doubled... (ten times better
>> performance with the larger transfer size)
>>
>> Something is not quite right, could you possibly try to run just the
>> alltoalls like I suggested in my previous e-mail?
>
> I was curious so I ran som tests. First it seems that the size reported by
> CPMD is the total size of the data buffer not the message size. Running
> alltoalls with 231821/64 and 102033/64 gives this (on a similar setup):
>
> bw for   4221    x 1595 B :  36.5 Mbytes/s       time was:  23.3 s
> bw for   4221    x 3623 B : 125.4 Mbytes/s       time was:  15.4 s
> bw for   4221    x 1595 B :  36.4 Mbytes/s       time was:  23.3 s
> bw for   4221    x 3623 B : 125.6 Mbytes/s       time was:  15.3 s
>
> So it does seem that OpenMPI has some problems with small alltoalls. It is
> obviously broken when you can get things across faster by sending more...
>
> As a reference I ran with a commercial MPI using the same program and node-set
> (I did not have MVAPICH nor IntelMPI on this system):
>
> bw for   4221    x 1595 B :  71.4 Mbytes/s       time was:  11.9 s
> bw for   4221    x 3623 B : 125.8 Mbytes/s       time was:  15.3 s
> bw for   4221    x 1595 B :  71.1 Mbytes/s       time was:  11.9 s
> bw for   4221    x 3623 B : 125.5 Mbytes/s       time was:  15.3 s
>
> To see when OpenMPI falls over I ran with an increasing packet size:
>
> bw for   10      x 2900 B :  59.8 Mbytes/s       time was:  61.2 ms
> bw for   10      x 2925 B :  59.2 Mbytes/s       time was:  62.2 ms
> bw for   10      x 2950 B :  59.4 Mbytes/s       time was:  62.6 ms
> bw for   10      x 2975 B :  58.5 Mbytes/s       time was:  64.1 ms
> bw for   10      x 3000 B : 113.5 Mbytes/s       time was:  33.3 ms
> bw for   10      x 3100 B : 116.1 Mbytes/s       time was:  33.6 ms
>
> The problem seems to be for packets with 1000Bytes < size < 3000Bytes with a
> hard edge at 3000Bytes. Your CPMD was communicating at more or less the worst
> case packet size.
>
> These are the figures for my "reference" MPI:
>
> bw for   10      x 2900 B : 110.3 Mbytes/s       time was:  33.1 ms
> bw for   10      x 2925 B : 110.4 Mbytes/s       time was:  33.4 ms
> bw for   10      x 2950 B : 111.5 Mbytes/s       time was:  33.3 ms
> bw for   10      x 2975 B : 112.4 Mbytes/s       time was:  33.4 ms
> bw for   10      x 3000 B : 118.2 Mbytes/s       time was:  32.0 ms
> bw for   10      x 3100 B : 114.1 Mbytes/s       time was:  34.2 ms
>
> Setup-details:
> hw: dual socket quad core harpertowns with ConnectX IB and 1:1 2-level tree
> sw: CentOS-5.3 x86_64 with OpenMPI-1.3b2 (did not have time to try 1.3.2) on
> OFED from CentOS (1.3.2-ish I think).
>
> /Peter
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to