1) I think OpenMPI does not use optimal algorithms for collectives. But neither does LAM. For example the MPI_Allreduce scales as log_2 N where N is the number of processors. MPICH uses optimized collectives and the MPI_Allreduce is essentially independent of N. Unfortunately MPICH has never had a good TCP interface so its typically slower overall than LAM or OpenMPI. Are there plans to develop optimized collectives for OpenMPI; if so, is there a timeline
2) I have found an additional problem in OpenMPI over TCP. MPI_AllReduce can run extremely slowly on large numbers of processors. Measuring throughput (message size / time) for 48 nodes with 16KByte messages (for example) I get only 0.12MBytes/sec. The same code with LAM gets 5.3MBytes/sec which is more reasonable. The problem seems to arise for a) more than 16 nodes and b) message sizes in the range 16-32KBytes. Normally this is the optimum size so its odd. Other message sizes are closer to LAM (though typically a little slower). I have run these tests with my own network test, but I can run IMB if necessary. Tony