I did some tests of Open-MPI version 1.0.2a4r8848. My motivation was
an extreme degradation of all-to-all MPI performance on 8 cpus (ran
like 1 cpu). At the same time, MPICH 1.2.7 on 8 cpus runs more like on
4 (not like 1 !!!).

 This was done using Skampi from :
http://liinwww.ira.uka.de/~skampi/skampi4.1.tar.gz

 The version 4.1 was used.

 The system is bunch of a dual Opterons connected by Gigabit.

 The MPI operation I am most interested in is all-to-all exchange.

 First of all, there seem to be some problems with the logarithmic
approach. Here is what I mean. In the following, first column is the
packet size, the next one is the average time (microseconds), then
goes standard deviation. The test was done on 8 cpus (4 dual nodes).

>mpirun -np 8 -mca mpi_paffinity_alone 1 skampi41
#/*@inp2p_MPI_Send-MPI_Iprobe_Recv.ski*/
#Description of the MPI_Send-MPI_Iprobe_Recv measurement:
       0      74.3      1.3      8      74.3      1.3      8
      16      77.4      2.1      8      77.4      2.1      8       0.0       0.0
      32     398.9    323.4    100     398.9    323.4    100       0.0       0.0
      64      80.7      2.3      9      80.7      2.3      9       0.0       0.0
      80      79.3      2.3     13      79.3      2.3     13       0.0       0.0

>mpirun -np 8 -mca mpi_paffinity_alone 1 -mca coll_basic_crossover 8 skampi41
#/*@inp2p_MPI_Send-MPI_Iprobe_Recv.ski*/
#Description of the MPI_Send-MPI_Iprobe_Recv measurement:
       0      76.7      2.1      8      76.7      2.1      8
      16      75.8      1.5      8      75.8      1.5      8       0.0       0.0
      32      74.4      0.6      8      74.4      0.6      8       0.0       0.0
      64      76.3      0.4      8      76.3      0.4      8       0.0       0.0
      80      76.7      0.5      8      76.7      0.5      8       0.0       0.0

 This anomalously large times for certain packet sizes (either 16 or
32) without increasing coll_basic_crossover to 8 show up for a whole
set of tests, so this is not a fluke.

 Next, the all-to-all thing. The short test included 64x4 byte messages.
The long one had 16384x4 byte messages.

> mpirun -np 8 -mca mpi_paffinity_alone 1  -mca coll_basic_crossover 8 skampi41
#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
       2      12.7      0.2      8      12.7      0.2      8
       3      56.1      0.3      8      56.1      0.3      8
       4      69.9      1.8      8      69.9      1.8      8
       5      87.0      2.2      8      87.0      2.2      8
       6      99.7      1.5      8      99.7      1.5      8
       7     122.5      2.2      8     122.5      2.2      8
       8     147.5      2.5      8     147.5      2.5      8

#/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
       2     188.5      0.3      8     188.5      0.3      8
       3    1680.5     16.6      8    1680.5     16.6      8
       4    2759.0     15.5      8    2759.0     15.5      8
       5    4110.2     34.0      8    4110.2     34.0      8
       6   75443.5  44383.9      6   75443.5  44383.9      6
       7  242133.4    870.5      2  242133.4    870.5      2
       8  252436.7   4016.8      8  252436.7   4016.8      8 

> mpirun -np 8 -mca mpi_paffinity_alone 1  -mca coll_basic_crossover 8 \
-mca coll_sm_info_num_procs 8 -mca btl_tcp_sndbuf 8388608 -mca btl_tcp_rcvbuf 8388608 skampi41
#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
       2      13.1      0.1      8      13.1      0.1      8
       3      57.4      0.3      8      57.4      0.3      8
       4      73.7      1.6      8      73.7      1.6      8
       5      87.1      2.0      8      87.1      2.0      8
       6     103.7      2.0      8     103.7      2.0      8
       7     118.3      2.4      8     118.3      2.4      8
       8     146.7      3.1      8     146.7      3.1      8

#/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
       2     185.8      0.6      8     185.8      0.6      8
       3    1760.4     17.3      8    1760.4     17.3      8
       4    2916.8     52.1      8    2916.8     52.1      8
       5  106993.4 102562.4      2  106993.4 102562.4      2
       6  260723.1   6679.1      2  260723.1   6679.1      2
       7  240225.2   6369.8      6  240225.2   6369.8      6
       8  250848.1   4863.2      6  250848.1   4863.2      6


> mpirun -np 8 -mca mpi_paffinity_alone 1  -mca coll_basic_crossover 8 \
-mca coll_sm_info_num_procs 8 -mca btl_tcp_sndbuf 8388608 \
-mca btl_tcp_rcvbuf 8388608 -mca btl_tcp_min_send_size 32768 \
-mca btl_tcp_max_send_size 65536 skampi41
#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
       2      13.5      0.2      8      13.5      0.2      8
       3      57.3      1.8      8      57.3      1.8      8
       4      68.8      0.5      8      68.8      0.5      8
       5      83.2      0.6      8      83.2      0.6      8
       6     102.9      1.8      8     102.9      1.8      8
       7     117.4      2.3      8     117.4      2.3      8
       8     149.3      2.1      8     149.3      2.1      8

#/*@insyncol_MPI_Alltoall-nodes-long-SM.ski*/
       2     187.5      0.5      8     187.5      0.5      8
       3    1661.1     33.4      8    1661.1     33.4      8
       4    2715.9      6.9      8    2715.9      6.9      8
       5  116805.2  43036.4      8  116805.2  43036.4      8
       6  163177.7  41363.4      7  163177.7  41363.4      7
       7  233105.5  20621.4      2  233105.5  20621.4      2
       8  332049.5  83860.5      2  332049.5  83860.5      2 


Same stuff for MPICH 1.2.7 (sockets, no shared memory):
#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
       2     312.5    106.5    100     312.5    106.5    100
       3     546.9    136.2    100     546.9    136.2    100
       4    2929.7    195.3    100    2929.7    195.3    100
       5    2070.3    203.7    100    2070.3    203.7    100
       6    2929.7    170.0    100    2929.7    170.0    100
       7    1328.1    186.0    100    1328.1    186.0    100
       8    3203.1    244.4    100    3203.1    244.4    100

#/*@insyncol_MPI_Alltoall-nodes-short-SM.ski*/
       2     390.6    117.8    100     390.6    117.8    100
       3    3164.1    252.6    100    3164.1    252.6    100
       4    5859.4    196.3    100    5859.4    196.3    100
       5   15234.4   6895.1     30   15234.4   6895.1     30
       6   18136.2   5563.7     14   18136.2   5563.7     14
       7   14204.5   2898.0     11   14204.5   2898.0     11
       8   11718.8   1594.7      4   11718.8   1594.7      4 

So, as one can see, MPICH latencies are much higher for small packets,
yet, things are way more consistent for larger ones. Depending on the
settings, Open-MPI either degrades at 5 or 6 cpus.

 Konstantin