Re: [O-MPI users] Open-MPI all-to-all performance

Galen M. Shipman Fri, 3 Feb 2006 00:27:45 -0500

Hello Konstantin,

By using coll_basic_crossover 8 you are forcing all of yourbenchmarks to use the basic collectives, which offer poorperformance. I ran the skampi Alltoall benchmark with the tunedcollectives I get the following results which seem to scale quitewell, when I have a bit more time I will provide comparisons with MPICH.

mpirun -np 8 -mca btl tcp -mca coll self,basic,tuned -mcampi_paffinity_alone 1 ./skampi


#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
       2      47.3      0.4      8      47.3      0.4      8
       3      57.9      1.7     40      57.9      1.7     40
       4      65.2      1.5      8      65.2      1.5      8
       5      74.0      2.1     10      74.0      2.1     10
       6      84.3      1.5      8      84.3      1.5      8
       7      89.9      0.4      8      89.9      0.4      8
       8     107.8      1.9      8     107.8      1.9      8

#/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/

       2    1049.0     29.8      8    1049.0     29.8      8
       3    1677.7     49.8     31    1677.7     49.8     31
       4    3287.0     96.8     11    3287.0     96.8     11
       5    3247.3     57.8      8    3247.3     57.8      8
       6    4802.5     98.6      8    4802.5     98.6      8
       7    6166.4     70.3      8    6166.4     70.3      8
       8    7380.8    116.1      8    7380.8    116.1      8

If I use the basic collectives then things do fall apart with longmessages, but this is expected.

mpirun -np 8 -mca btl tcp -mca coll self,basic -mcampi_paffinity_alone 1 ./skampi


#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/

 2      45.7      0.2      8      45.7      0.2      8
       3      55.0      0.9      8      55.0      0.9      8
       4      64.2      0.4      8      64.2      0.4      8
       5      73.4      1.2      8      73.4      1.2      8
       6      83.5      0.5      8      83.5      0.5      8
       7      92.8      1.4      8      92.8      1.4      8
       8     108.1      2.2      8     108.1      2.2      8


#/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/

       2     798.0      1.5      8     798.0      1.5      8
       3    1756.0     38.5      8    1756.0     38.5      8
       4   99601.8  60958.5      5   99601.8  60958.5      5
       5  134846.3  31683.9     11  134846.3  31683.9     11
       6  224243.7   6599.1     11  224243.7   6599.1     11
       7  230021.1   6788.1     10  230021.1   6788.1     10
       8  242596.5   7693.6      6  242596.5   7693.6      6



On Feb 2, 2006, at 5:10 PM, Konstantin Kudin wrote:

 Hi all,

 There seem to have been problems with the attachement. Here is the
report:

 I did some tests of Open-MPI version 1.0.2a4r8848. My motivation was
an extreme degradation of all-to-all MPI performance on 8 cpus (ran
like 1 cpu). At the same time, MPICH 1.2.7 on 8 cpus runs more like on
4 (not like 1 !!!).

 This was done using Skampi from :
http://liinwww.ira.uka.de/~skampi/skampi4.1.tar.gz

 The version 4.1 was used.

 The system is bunch of a dual Opterons connected by Gigabit.

 The MPI operation I am most interested in is all-to-all exchange.

 First of all, there seem to be some problems with the logarithmic
approach. Here is what I mean. In the following, first column is the
packet size, the next one is the average time (microseconds), then
goes standard deviation. The test was done on 8 cpus (4 dual nodes).

mpirun -np 8 -mca mpi_paffinity_alone 1 skampi41

#/*@inp2p_MPI_Send-MPI_Iprobe_Recv.ski*/
#Description of the MPI_Send-MPI_Iprobe_Recv measurement:
       0      74.3      1.3      8      74.3      1.3      8
      16      77.4      2.1      8      77.4      2.1      8       0.0
     0.0
      32     398.9    323.4    100     398.9    323.4    100       0.0
     0.0
      64      80.7      2.3      9      80.7      2.3      9       0.0
     0.0
      80      79.3      2.3     13      79.3      2.3     13       0.0
     0.0

mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8

skampi41
#/*@inp2p_MPI_Send-MPI_Iprobe_Recv.ski*/
#Description of the MPI_Send-MPI_Iprobe_Recv measurement:
       0      76.7      2.1      8      76.7      2.1      8
      16      75.8      1.5      8      75.8      1.5      8       0.0
     0.0
      32      74.4      0.6      8      74.4      0.6      8       0.0
     0.0
      64      76.3      0.4      8      76.3      0.4      8       0.0
     0.0
      80      76.7      0.5      8      76.7      0.5      8       0.0
     0.0

 This anomalously large times for certain packet sizes (either 16 or
32) without increasing coll_basic_crossover to 8 show up for a whole
set of tests, so this is not a fluke.

 Next, the all-to-all thing. The short test included 64x4 byte
messages.
The long one had 16384x4 byte messages.

mpirun -np 8 -mca mpi_paffinity_alone 1  -mca coll_basic_crossover 8

skampi41
#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
       2      12.7      0.2      8      12.7      0.2      8
       3      56.1      0.3      8      56.1      0.3      8
       4      69.9      1.8      8      69.9      1.8      8
       5      87.0      2.2      8      87.0      2.2      8
       6      99.7      1.5      8      99.7      1.5      8
       7     122.5      2.2      8     122.5      2.2      8
       8     147.5      2.5      8     147.5      2.5      8

#/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
       2     188.5      0.3      8     188.5      0.3      8
       3    1680.5     16.6      8    1680.5     16.6      8
       4    2759.0     15.5      8    2759.0     15.5      8
       5    4110.2     34.0      8    4110.2     34.0      8
       6   75443.5  44383.9      6   75443.5  44383.9      6
       7  242133.4    870.5      2  242133.4    870.5      2
       8  252436.7   4016.8      8  252436.7   4016.8      8

mpirun -np 8 -mca mpi_paffinity_alone 1  -mca coll_basic_crossover 8

\
-mca coll_sm_info_num_procs 8 -mca btl_tcp_sndbuf 8388608 -mca
btl_tcp_rcvbuf 8388608 skampi41
#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
       2      13.1      0.1      8      13.1      0.1      8
       3      57.4      0.3      8      57.4      0.3      8
       4      73.7      1.6      8      73.7      1.6      8
       5      87.1      2.0      8      87.1      2.0      8
       6     103.7      2.0      8     103.7      2.0      8
       7     118.3      2.4      8     118.3      2.4      8
       8     146.7      3.1      8     146.7      3.1      8

#/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
       2     185.8      0.6      8     185.8      0.6      8
       3    1760.4     17.3      8    1760.4     17.3      8
       4    2916.8     52.1      8    2916.8     52.1      8
       5  106993.4 102562.4      2  106993.4 102562.4      2
       6  260723.1   6679.1      2  260723.1   6679.1      2
       7  240225.2   6369.8      6  240225.2   6369.8      6
       8  250848.1   4863.2      6  250848.1   4863.2      6

mpirun -np 8 -mca mpi_paffinity_alone 1  -mca coll_basic_crossover 8

\
-mca coll_sm_info_num_procs 8 -mca btl_tcp_sndbuf 8388608 \
-mca btl_tcp_rcvbuf 8388608 -mca btl_tcp_min_send_size 32768 \
-mca btl_tcp_max_send_size 65536 skampi41
#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
       2      13.5      0.2      8      13.5      0.2      8
       3      57.3      1.8      8      57.3      1.8      8
       4      68.8      0.5      8      68.8      0.5      8
       5      83.2      0.6      8      83.2      0.6      8
       6     102.9      1.8      8     102.9      1.8      8
       7     117.4      2.3      8     117.4      2.3      8
       8     149.3      2.1      8     149.3      2.1      8

#/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
       2     187.5      0.5      8     187.5      0.5      8
       3    1661.1     33.4      8    1661.1     33.4      8
       4    2715.9      6.9      8    2715.9      6.9      8
       5  116805.2  43036.4      8  116805.2  43036.4      8
       6  163177.7  41363.4      7  163177.7  41363.4      7
       7  233105.5  20621.4      2  233105.5  20621.4      2
       8  332049.5  83860.5      2  332049.5  83860.5      2


Same stuff for MPICH 1.2.7 (sockets, no shared memory):
#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
       2     312.5    106.5    100     312.5    106.5    100
       3     546.9    136.2    100     546.9    136.2    100
       4    2929.7    195.3    100    2929.7    195.3    100
       5    2070.3    203.7    100    2070.3    203.7    100
       6    2929.7    170.0    100    2929.7    170.0    100
       7    1328.1    186.0    100    1328.1    186.0    100
       8    3203.1    244.4    100    3203.1    244.4    100

#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
       2     390.6    117.8    100     390.6    117.8    100
       3    3164.1    252.6    100    3164.1    252.6    100
       4    5859.4    196.3    100    5859.4    196.3    100
       5   15234.4   6895.1     30   15234.4   6895.1     30
       6   18136.2   5563.7     14   18136.2   5563.7     14
       7   14204.5   2898.0     11   14204.5   2898.0     11
       8   11718.8   1594.7      4   11718.8   1594.7      4

So, as one can see, MPICH latencies are much higher for small packets,
yet, things are way more consistent for larger ones. Depending on the
settings, Open-MPI either degrades at 5 or 6 cpus.

 Konstantin




__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [O-MPI users] Open-MPI all-to-all performance

Reply via email to