1) I think OpenMPI does not use optimal algorithms for collectives. But
neither does LAM. For example the MPI_Allreduce scales as log_2 N where N is
the number of processors. MPICH uses optimized collectives and the
MPI_Allreduce is essentially independent of N. Unfortunately MPICH has never
had a good TCP interface so its typically slower overall than LAM or
OpenMPI. Are there plans to develop optimized collectives for OpenMPI; if
so, is there a timeline

2) I have found an additional problem in OpenMPI over TCP. MPI_AllReduce can
run extremely slowly on large numbers of processors. Measuring throughput
(message size / time) for 48 nodes with 16KByte messages (for example) I get
only 0.12MBytes/sec. The same code with LAM gets 5.3MBytes/sec which is more
reasonable. The problem seems to arise for a) more than 16 nodes and b)
message sizes in the range 16-32KBytes. Normally this is the optimum size so
its odd. Other message sizes are closer to LAM (though typically a little
slower). I have run these tests with my own network test, but I can run IMB
if necessary.

Tony


Reply via email to