I did some tests of Open-MPI version 1.0.2a4r8848. My motivation was an extreme degradation of all-to-all MPI performance on 8 cpus (ran like 1 cpu). At the same time, MPICH 1.2.7 on 8 cpus runs more like on 4 (not like 1 !!!). This was done using Skampi from : http://liinwww.ira.uka.de/~skampi/skampi4.1.tar.gz The version 4.1 was used. The system is bunch of a dual Opterons connected by Gigabit. The MPI operation I am most interested in is all-to-all exchange. First of all, there seem to be some problems with the logarithmic approach. Here is what I mean. In the following, first column is the packet size, the next one is the average time (microseconds), then goes standard deviation. The test was done on 8 cpus (4 dual nodes). >mpirun -np 8 -mca mpi_paffinity_alone 1 skampi41 #/*@inp2p_MPI_Send-MPI_Iprobe_Recv.ski*/ #Description of the MPI_Send-MPI_Iprobe_Recv measurement: 0 74.3 1.3 8 74.3 1.3 8 16 77.4 2.1 8 77.4 2.1 8 0.0 0.0 32 398.9 323.4 100 398.9 323.4 100 0.0 0.0 64 80.7 2.3 9 80.7 2.3 9 0.0 0.0 80 79.3 2.3 13 79.3 2.3 13 0.0 0.0 >mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8 skampi41 #/*@inp2p_MPI_Send-MPI_Iprobe_Recv.ski*/ #Description of the MPI_Send-MPI_Iprobe_Recv measurement: 0 76.7 2.1 8 76.7 2.1 8 16 75.8 1.5 8 75.8 1.5 8 0.0 0.0 32 74.4 0.6 8 74.4 0.6 8 0.0 0.0 64 76.3 0.4 8 76.3 0.4 8 0.0 0.0 80 76.7 0.5 8 76.7 0.5 8 0.0 0.0 This anomalously large times for certain packet sizes (either 16 or 32) without increasing coll_basic_crossover to 8 show up for a whole set of tests, so this is not a fluke. Next, the all-to-all thing. The short test included 64x4 byte messages. The long one had 16384x4 byte messages. > mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8 skampi41 #/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/ 2 12.7 0.2 8 12.7 0.2 8 3 56.1 0.3 8 56.1 0.3 8 4 69.9 1.8 8 69.9 1.8 8 5 87.0 2.2 8 87.0 2.2 8 6 99.7 1.5 8 99.7 1.5 8 7 122.5 2.2 8 122.5 2.2 8 8 147.5 2.5 8 147.5 2.5 8 #/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/ 2 188.5 0.3 8 188.5 0.3 8 3 1680.5 16.6 8 1680.5 16.6 8 4 2759.0 15.5 8 2759.0 15.5 8 5 4110.2 34.0 8 4110.2 34.0 8 6 75443.5 44383.9 6 75443.5 44383.9 6 7 242133.4 870.5 2 242133.4 870.5 2 8 252436.7 4016.8 8 252436.7 4016.8 8 > mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8 \ -mca coll_sm_info_num_procs 8 -mca btl_tcp_sndbuf 8388608 -mca btl_tcp_rcvbuf 8388608 skampi41 #/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/ 2 13.1 0.1 8 13.1 0.1 8 3 57.4 0.3 8 57.4 0.3 8 4 73.7 1.6 8 73.7 1.6 8 5 87.1 2.0 8 87.1 2.0 8 6 103.7 2.0 8 103.7 2.0 8 7 118.3 2.4 8 118.3 2.4 8 8 146.7 3.1 8 146.7 3.1 8 #/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/ 2 185.8 0.6 8 185.8 0.6 8 3 1760.4 17.3 8 1760.4 17.3 8 4 2916.8 52.1 8 2916.8 52.1 8 5 106993.4 102562.4 2 106993.4 102562.4 2 6 260723.1 6679.1 2 260723.1 6679.1 2 7 240225.2 6369.8 6 240225.2 6369.8 6 8 250848.1 4863.2 6 250848.1 4863.2 6 > mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8 \ -mca coll_sm_info_num_procs 8 -mca btl_tcp_sndbuf 8388608 \ -mca btl_tcp_rcvbuf 8388608 -mca btl_tcp_min_send_size 32768 \ -mca btl_tcp_max_send_size 65536 skampi41 #/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/ 2 13.5 0.2 8 13.5 0.2 8 3 57.3 1.8 8 57.3 1.8 8 4 68.8 0.5 8 68.8 0.5 8 5 83.2 0.6 8 83.2 0.6 8 6 102.9 1.8 8 102.9 1.8 8 7 117.4 2.3 8 117.4 2.3 8 8 149.3 2.1 8 149.3 2.1 8 #/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/ 2 187.5 0.5 8 187.5 0.5 8 3 1661.1 33.4 8 1661.1 33.4 8 4 2715.9 6.9 8 2715.9 6.9 8 5 116805.2 43036.4 8 116805.2 43036.4 8 6 163177.7 41363.4 7 163177.7 41363.4 7 7 233105.5 20621.4 2 233105.5 20621.4 2 8 332049.5 83860.5 2 332049.5 83860.5 2 Same stuff for MPICH 1.2.7 (sockets, no shared memory): #/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/ 2 312.5 106.5 100 312.5 106.5 100 3 546.9 136.2 100 546.9 136.2 100 4 2929.7 195.3 100 2929.7 195.3 100 5 2070.3 203.7 100 2070.3 203.7 100 6 2929.7 170.0 100 2929.7 170.0 100 7 1328.1 186.0 100 1328.1 186.0 100 8 3203.1 244.4 100 3203.1 244.4 100 #/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/ 2 390.6 117.8 100 390.6 117.8 100 3 3164.1 252.6 100 3164.1 252.6 100 4 5859.4 196.3 100 5859.4 196.3 100 5 15234.4 6895.1 30 15234.4 6895.1 30 6 18136.2 5563.7 14 18136.2 5563.7 14 7 14204.5 2898.0 11 14204.5 2898.0 11 8 11718.8 1594.7 4 11718.8 1594.7 4 So, as one can see, MPICH latencies are much higher for small packets, yet, things are way more consistent for larger ones. Depending on the settings, Open-MPI either degrades at 5 or 6 cpus. Konstantin