The correct MCA parameters are the following:
-mca coll_tuned_use_dynamic_rules 1
-mca coll_tuned_dynamic_rules_filename ./dyn_rules
You can also run the following command:
ompi_info -mca coll_tuned_use_dynamic_rules 1 -param coll tuned
This will give some insight into all the various algorithms that make up
the tuned collectives.
If I am understanding what is happening, it looks like the original
MPI_Alltoall made use of three algorithms. (You can look in
coll_tuned_decision_fixed.c)
If message size < 200 or communicator size > 12
bruck
else if message size < 3000
basic linear
else
pairwise
end
With the file Pavel has provided things have changed to the following.
(maybe someone can confirm)
If message size < 8192
bruck
else
pairwise
end
Rolf
On 05/20/09 07:48, Roman Martonak wrote:
Many thanks for the highly helpful analysis. Indeed, what Peter says
seems to be precisely the case here. I tried to run the 32 waters test
on 48 cores now, with the original cutoff of 100 Ry, and with slightly
increased one of 110 Ry. Normally with larger cutoff it should
obviously take more time for one step. Increasing cutoff however also
increases the size of the data buffer and it appears just to cross the
packet size threshold for different behaviour (test was ran with
openmpi-1.3.2).
--------------------------------------------------------------------------------------------------------------------------------------------------------
cutoff 100Ry
time per 1 step is 2.869 s
= ALL TO ALL COMM 151583. BYTES 2211. =
= ALL TO ALL COMM 16.741 MB/S 20.020 SEC =
--------------------------------------------------------------------------------------------------------------------------------------------------------
cutoff 110 Ry
time per 1 step is 1.879 s
= ALL TO ALL COMM 167057. BYTES 2211. =
= ALL TO ALL COMM 43.920 MB/S 8.410 SEC =
--------------------------------------------------------------------------------------------------------------------------------------------------------
so it actually runs much faster and ALL TO ALL COMM is 2.6 times
faster. In my case the threshold seems to be somewhere between
167057/48 = 3 480 and 151583/48 = 3 157 bytes.
I saved the text that Pavel suggested
1 # num of collectives
3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
1 # number of com sizes
64 # comm size 8
2 # number of msg sizes
0 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
# end of first collective
to the file dyn_rules and tried to run appending the options
"--mca use_dynamic_rules 1 --mca dynamic_rules_filename ./dyn_rules" to mpirun
but it does not make any change. Is this the correct syntax to enable
the rules ?
And will the above sample file shift the threshold to lower values (to
what value) ?
Best regards
Roman
On Wed, May 20, 2009 at 10:39 AM, Peter Kjellstrom <c...@nsc.liu.se> wrote:
On Tuesday 19 May 2009, Peter Kjellstrom wrote:
On Tuesday 19 May 2009, Roman Martonak wrote:
On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom <c...@nsc.liu.se> wrote:
On Tuesday 19 May 2009, Roman Martonak wrote:
...
openmpi-1.3.2 time per one MD step is 3.66 s
ELAPSED TIME : 0 HOURS 1 MINUTES 25.90 SECONDS
= ALL TO ALL COMM 102033. BYTES 4221. =
= ALL TO ALL COMM 7.802 MB/S 55.200 SEC =
...
With TASKGROUP=2 the summary looks as follows
...
= ALL TO ALL COMM 231821. BYTES 4221. =
= ALL TO ALL COMM 82.716 MB/S 11.830 SEC =
Wow, according to this it takes 1/5th the time to do the same number (4221)
of alltoalls if the size is (roughly) doubled... (ten times better
performance with the larger transfer size)
Something is not quite right, could you possibly try to run just the
alltoalls like I suggested in my previous e-mail?
I was curious so I ran som tests. First it seems that the size reported by
CPMD is the total size of the data buffer not the message size. Running
alltoalls with 231821/64 and 102033/64 gives this (on a similar setup):
bw for 4221 x 1595 B : 36.5 Mbytes/s time was: 23.3 s
bw for 4221 x 3623 B : 125.4 Mbytes/s time was: 15.4 s
bw for 4221 x 1595 B : 36.4 Mbytes/s time was: 23.3 s
bw for 4221 x 3623 B : 125.6 Mbytes/s time was: 15.3 s
So it does seem that OpenMPI has some problems with small alltoalls. It is
obviously broken when you can get things across faster by sending more...
As a reference I ran with a commercial MPI using the same program and node-set
(I did not have MVAPICH nor IntelMPI on this system):
bw for 4221 x 1595 B : 71.4 Mbytes/s time was: 11.9 s
bw for 4221 x 3623 B : 125.8 Mbytes/s time was: 15.3 s
bw for 4221 x 1595 B : 71.1 Mbytes/s time was: 11.9 s
bw for 4221 x 3623 B : 125.5 Mbytes/s time was: 15.3 s
To see when OpenMPI falls over I ran with an increasing packet size:
bw for 10 x 2900 B : 59.8 Mbytes/s time was: 61.2 ms
bw for 10 x 2925 B : 59.2 Mbytes/s time was: 62.2 ms
bw for 10 x 2950 B : 59.4 Mbytes/s time was: 62.6 ms
bw for 10 x 2975 B : 58.5 Mbytes/s time was: 64.1 ms
bw for 10 x 3000 B : 113.5 Mbytes/s time was: 33.3 ms
bw for 10 x 3100 B : 116.1 Mbytes/s time was: 33.6 ms
The problem seems to be for packets with 1000Bytes < size < 3000Bytes with a
hard edge at 3000Bytes. Your CPMD was communicating at more or less the worst
case packet size.
These are the figures for my "reference" MPI:
bw for 10 x 2900 B : 110.3 Mbytes/s time was: 33.1 ms
bw for 10 x 2925 B : 110.4 Mbytes/s time was: 33.4 ms
bw for 10 x 2950 B : 111.5 Mbytes/s time was: 33.3 ms
bw for 10 x 2975 B : 112.4 Mbytes/s time was: 33.4 ms
bw for 10 x 3000 B : 118.2 Mbytes/s time was: 32.0 ms
bw for 10 x 3100 B : 114.1 Mbytes/s time was: 34.2 ms
Setup-details:
hw: dual socket quad core harpertowns with ConnectX IB and 1:1 2-level tree
sw: CentOS-5.3 x86_64 with OpenMPI-1.3b2 (did not have time to try 1.3.2) on
OFED from CentOS (1.3.2-ish I think).
/Peter
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
=========================
rolf.vandeva...@sun.com
781-442-3043
=========================