I tried the settings suggested by Peter and it indeed helps to improve
much more. Running on 64 cores with the line (in dyn_rules)

8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation

I get the following

bw for   100     x 10 B :   1.9 Mbytes/s         time was:  65.4 ms
bw for   100     x 20 B :   3.7 Mbytes/s         time was:  67.9 ms
bw for   100     x 50 B :   9.5 Mbytes/s         time was:  66.5 ms
bw for   100     x 100 B :  20.1 Mbytes/s        time was:  62.8 ms
bw for   100     x 128 B :   1.8 Mbytes/s        time was: 873.2 ms
bw for   100     x 200 B :   2.9 Mbytes/s        time was: 859.6 ms
bw for   100     x 256 B :   3.7 Mbytes/s        time was: 871.0 ms
bw for   100     x 500 B :   7.4 Mbytes/s        time was: 848.6 ms
bw for   100     x 1000 B :  15.5 Mbytes/s       time was: 813.8 ms
bw for   100     x 1500 B :  24.2 Mbytes/s       time was: 780.8 ms
bw for   100     x 2000 B :  34.3 Mbytes/s       time was: 734.6 ms
bw for   100     x 3000 B :  62.2 Mbytes/s       time was: 607.6 ms
bw for   100     x 4096 B : 100.2 Mbytes/s       time was: 515.0 ms
bw for   100     x 5000 B : 116.1 Mbytes/s       time was: 542.5 ms
bw for   100     x 10000 B : 166.9 Mbytes/s      time was: 755.0 ms
bw for   100     x 20000 B :  64.7 Mbytes/s      time was:   3.9 s
bw for   100     x 50000 B : 125.5 Mbytes/s      time was:   5.0 s

so it seems indeed to switch at 8192/64 = 128 bytes and the
performance is poor (though still much better for CPMD than with
default openmpi-1.3.2 settings)

On the other hand, with the line
262144 2 0 0 # 8k+, pairwise 2, no topo or segmentation

I get far better numbers

bw for   100     x 10 B :   1.9 Mbytes/s         time was:  67.0 ms
bw for   100     x 20 B :   3.8 Mbytes/s         time was:  66.7 ms
bw for   100     x 50 B :   9.6 Mbytes/s         time was:  65.9 ms
bw for   100     x 100 B :  19.7 Mbytes/s        time was:  64.1 ms
bw for   100     x 128 B :  23.5 Mbytes/s        time was:  68.6 ms
bw for   100     x 200 B :  36.6 Mbytes/s        time was:  68.8 ms
bw for   100     x 256 B :  43.7 Mbytes/s        time was:  73.7 ms
bw for   100     x 500 B :  31.4 Mbytes/s        time was: 200.7 ms
bw for   100     x 1000 B :  52.7 Mbytes/s       time was: 239.3 ms
bw for   100     x 1500 B :  72.5 Mbytes/s       time was: 260.9 ms
bw for   100     x 2000 B :  75.1 Mbytes/s       time was: 335.8 ms
bw for   100     x 3000 B :  74.8 Mbytes/s       time was: 505.6 ms
bw for   100     x 4096 B :  99.9 Mbytes/s       time was: 516.4 ms
bw for   100     x 5000 B : 119.3 Mbytes/s       time was: 528.1 ms
bw for   100     x 10000 B : 167.5 Mbytes/s      time was: 752.4 ms
bw for   100     x 20000 B :  64.9 Mbytes/s      time was:   3.9 s
bw for   100     x 50000 B : 126.0 Mbytes/s      time was:   5.0 s

Concerning the 32 waters CPMD time per step, with default settings it
was 3.664 s, with threshold of 8192 it was about 2.55 s and finally
with threshold of  262144
it is now 1.56 s, the same as with IntelMPI (and better than default
mvapich which is 2.55 s)

Roman


On Thu, May 21, 2009 at 9:09 AM, Pavel Shamis (Pasha) <pash...@gmail.com> wrote:
>
>> I tried to run with the first dynamic rules file that Pavel proposed
>> and it works, the time per one MD step on 48 cores decreased from 2.8
>> s to 1.8 s as expected.
>
> Good news :-)
>
> Pasha.
>>
>> Thanks
>>
>> Roman
>>
>> On Wed, May 20, 2009 at 7:18 PM, Pavel Shamis (Pasha) <pash...@gmail.com>
>> wrote:
>>
>>>
>>> Tomorrow I will add some printf to collective code and check what really
>>> happens there...
>>>
>>> Pasha
>>>
>>> Peter Kjellstrom wrote:
>>>
>>>>
>>>> On Wednesday 20 May 2009, Pavel Shamis (Pasha) wrote:
>>>>
>>>>
>>>>>>
>>>>>> Disabling basic_linear seems like a good idea but your config file
>>>>>> sets
>>>>>> the cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems
>>>>>> to
>>>>>> result in a message size of that value divided by the number of
>>>>>> ranks).
>>>>>>
>>>>>> In my testing bruck seems to win clearly (at least for 64 ranks on my
>>>>>> IB)
>>>>>> up to 2048. Hence, the following line may be better:
>>>>>>
>>>>>>  131072 2 0 0 # switch to pair wise for size 128K/nranks
>>>>>>
>>>>>> Disclaimer: I guess this could differ quite a bit for nranks!=64 and
>>>>>> different btls.
>>>>>>
>>>>>>
>>>>>
>>>>> Sounds strange for me. From the code is looks that we take the
>>>>> threshold
>>>>> as
>>>>> is without dividing by number of ranks.
>>>>>
>>>>>
>>>>
>>>> Interesting, I may have had to little or too much coffe but the figures
>>>> in
>>>> my previous e-mail (3rd run, bruckto2k_pair) was run with the above
>>>> line.
>>>> And it very much looks like it switched at 128K/64=2K, not at 128K
>>>> (which
>>>> would have been above my largest size of 3000 and as such equiv. to
>>>> all_bruck).
>>>>
>>>> I also ran tests with:
>>>>  8192 2 0 0 # ...
>>>> And it seemed to switch between 10 Bytes and 500 Bytes (most likely then
>>>> at 8192/64=128).
>>>>
>>>> My testprogram calls MPI_Alltoall like this:
>>>>  time1 = MPI_Wtime();
>>>>  for (i = 0; i < repetitions; i++) {
>>>>   MPI_Alltoall(sbuf, message_size, MPI_CHAR,
>>>>                rbuf, message_size, MPI_CHAR, MPI_COMM_WORLD);
>>>>  }
>>>>  time2 = MPI_Wtime();
>>>>
>>>> /Peter
>>>>
>>>>  ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>
>>
>
>

Reply via email to