Hello Konstantin,
By using coll_basic_crossover 8 you are forcing all of your
benchmarks to use the basic collectives, which offer poor
performance. I ran the skampi Alltoall benchmark with the tuned
collectives I get the following results which seem to scale quite
well, when I have a bit more time I will provide comparisons with MPICH.
mpirun -np 8 -mca btl tcp -mca coll self,basic,tuned -mca
mpi_paffinity_alone 1 ./skampi
#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
2 47.3 0.4 8 47.3 0.4 8
3 57.9 1.7 40 57.9 1.7 40
4 65.2 1.5 8 65.2 1.5 8
5 74.0 2.1 10 74.0 2.1 10
6 84.3 1.5 8 84.3 1.5 8
7 89.9 0.4 8 89.9 0.4 8
8 107.8 1.9 8 107.8 1.9 8
#/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
2 1049.0 29.8 8 1049.0 29.8 8
3 1677.7 49.8 31 1677.7 49.8 31
4 3287.0 96.8 11 3287.0 96.8 11
5 3247.3 57.8 8 3247.3 57.8 8
6 4802.5 98.6 8 4802.5 98.6 8
7 6166.4 70.3 8 6166.4 70.3 8
8 7380.8 116.1 8 7380.8 116.1 8
If I use the basic collectives then things do fall apart with long
messages, but this is expected.
mpirun -np 8 -mca btl tcp -mca coll self,basic -mca
mpi_paffinity_alone 1 ./skampi
#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
2 45.7 0.2 8 45.7 0.2 8
3 55.0 0.9 8 55.0 0.9 8
4 64.2 0.4 8 64.2 0.4 8
5 73.4 1.2 8 73.4 1.2 8
6 83.5 0.5 8 83.5 0.5 8
7 92.8 1.4 8 92.8 1.4 8
8 108.1 2.2 8 108.1 2.2 8
#/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
2 798.0 1.5 8 798.0 1.5 8
3 1756.0 38.5 8 1756.0 38.5 8
4 99601.8 60958.5 5 99601.8 60958.5 5
5 134846.3 31683.9 11 134846.3 31683.9 11
6 224243.7 6599.1 11 224243.7 6599.1 11
7 230021.1 6788.1 10 230021.1 6788.1 10
8 242596.5 7693.6 6 242596.5 7693.6 6
On Feb 2, 2006, at 5:10 PM, Konstantin Kudin wrote:
Hi all,
There seem to have been problems with the attachement. Here is the
report:
I did some tests of Open-MPI version 1.0.2a4r8848. My motivation was
an extreme degradation of all-to-all MPI performance on 8 cpus (ran
like 1 cpu). At the same time, MPICH 1.2.7 on 8 cpus runs more like on
4 (not like 1 !!!).
This was done using Skampi from :
http://liinwww.ira.uka.de/~skampi/skampi4.1.tar.gz
The version 4.1 was used.
The system is bunch of a dual Opterons connected by Gigabit.
The MPI operation I am most interested in is all-to-all exchange.
First of all, there seem to be some problems with the logarithmic
approach. Here is what I mean. In the following, first column is the
packet size, the next one is the average time (microseconds), then
goes standard deviation. The test was done on 8 cpus (4 dual nodes).
mpirun -np 8 -mca mpi_paffinity_alone 1 skampi41
#/*@inp2p_MPI_Send-MPI_Iprobe_Recv.ski*/
#Description of the MPI_Send-MPI_Iprobe_Recv measurement:
0 74.3 1.3 8 74.3 1.3 8
16 77.4 2.1 8 77.4 2.1 8 0.0
0.0
32 398.9 323.4 100 398.9 323.4 100 0.0
0.0
64 80.7 2.3 9 80.7 2.3 9 0.0
0.0
80 79.3 2.3 13 79.3 2.3 13 0.0
0.0
mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8
skampi41
#/*@inp2p_MPI_Send-MPI_Iprobe_Recv.ski*/
#Description of the MPI_Send-MPI_Iprobe_Recv measurement:
0 76.7 2.1 8 76.7 2.1 8
16 75.8 1.5 8 75.8 1.5 8 0.0
0.0
32 74.4 0.6 8 74.4 0.6 8 0.0
0.0
64 76.3 0.4 8 76.3 0.4 8 0.0
0.0
80 76.7 0.5 8 76.7 0.5 8 0.0
0.0
This anomalously large times for certain packet sizes (either 16 or
32) without increasing coll_basic_crossover to 8 show up for a whole
set of tests, so this is not a fluke.
Next, the all-to-all thing. The short test included 64x4 byte
messages.
The long one had 16384x4 byte messages.
mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8
skampi41
#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
2 12.7 0.2 8 12.7 0.2 8
3 56.1 0.3 8 56.1 0.3 8
4 69.9 1.8 8 69.9 1.8 8
5 87.0 2.2 8 87.0 2.2 8
6 99.7 1.5 8 99.7 1.5 8
7 122.5 2.2 8 122.5 2.2 8
8 147.5 2.5 8 147.5 2.5 8
#/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
2 188.5 0.3 8 188.5 0.3 8
3 1680.5 16.6 8 1680.5 16.6 8
4 2759.0 15.5 8 2759.0 15.5 8
5 4110.2 34.0 8 4110.2 34.0 8
6 75443.5 44383.9 6 75443.5 44383.9 6
7 242133.4 870.5 2 242133.4 870.5 2
8 252436.7 4016.8 8 252436.7 4016.8 8
mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8
\
-mca coll_sm_info_num_procs 8 -mca btl_tcp_sndbuf 8388608 -mca
btl_tcp_rcvbuf 8388608 skampi41
#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
2 13.1 0.1 8 13.1 0.1 8
3 57.4 0.3 8 57.4 0.3 8
4 73.7 1.6 8 73.7 1.6 8
5 87.1 2.0 8 87.1 2.0 8
6 103.7 2.0 8 103.7 2.0 8
7 118.3 2.4 8 118.3 2.4 8
8 146.7 3.1 8 146.7 3.1 8
#/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
2 185.8 0.6 8 185.8 0.6 8
3 1760.4 17.3 8 1760.4 17.3 8
4 2916.8 52.1 8 2916.8 52.1 8
5 106993.4 102562.4 2 106993.4 102562.4 2
6 260723.1 6679.1 2 260723.1 6679.1 2
7 240225.2 6369.8 6 240225.2 6369.8 6
8 250848.1 4863.2 6 250848.1 4863.2 6
mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8
\
-mca coll_sm_info_num_procs 8 -mca btl_tcp_sndbuf 8388608 \
-mca btl_tcp_rcvbuf 8388608 -mca btl_tcp_min_send_size 32768 \
-mca btl_tcp_max_send_size 65536 skampi41
#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
2 13.5 0.2 8 13.5 0.2 8
3 57.3 1.8 8 57.3 1.8 8
4 68.8 0.5 8 68.8 0.5 8
5 83.2 0.6 8 83.2 0.6 8
6 102.9 1.8 8 102.9 1.8 8
7 117.4 2.3 8 117.4 2.3 8
8 149.3 2.1 8 149.3 2.1 8
#/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
2 187.5 0.5 8 187.5 0.5 8
3 1661.1 33.4 8 1661.1 33.4 8
4 2715.9 6.9 8 2715.9 6.9 8
5 116805.2 43036.4 8 116805.2 43036.4 8
6 163177.7 41363.4 7 163177.7 41363.4 7
7 233105.5 20621.4 2 233105.5 20621.4 2
8 332049.5 83860.5 2 332049.5 83860.5 2
Same stuff for MPICH 1.2.7 (sockets, no shared memory):
#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
2 312.5 106.5 100 312.5 106.5 100
3 546.9 136.2 100 546.9 136.2 100
4 2929.7 195.3 100 2929.7 195.3 100
5 2070.3 203.7 100 2070.3 203.7 100
6 2929.7 170.0 100 2929.7 170.0 100
7 1328.1 186.0 100 1328.1 186.0 100
8 3203.1 244.4 100 3203.1 244.4 100
#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
2 390.6 117.8 100 390.6 117.8 100
3 3164.1 252.6 100 3164.1 252.6 100
4 5859.4 196.3 100 5859.4 196.3 100
5 15234.4 6895.1 30 15234.4 6895.1 30
6 18136.2 5563.7 14 18136.2 5563.7 14
7 14204.5 2898.0 11 14204.5 2898.0 11
8 11718.8 1594.7 4 11718.8 1594.7 4
So, as one can see, MPICH latencies are much higher for small packets,
yet, things are way more consistent for larger ones. Depending on the
settings, Open-MPI either degrades at 5 or 6 cpus.
Konstantin
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users