Hello, I am desparately trying to get better all-to-all performance on Gbit Ethernet (flow control is enabled). I have been playing around with several all-to-all schemes and been able to reduce congestion by communicating in an ordered fashion.
E.g. the simplest scheme looks like for (i=0; i<ncpu; i++) { /* send to dest */ dest = (cpuid + i) % ncpu; /* receive from source */ source = (ncpu + cpuid - i) % ncpu; MPI_Sendrecv(sendbuf+dest *sendcount, sendcount, sendtype, dest , 0, recvbuf+source*recvcount, recvcount, recvtype, source, 0, comm, &status); } For sendcount=32768 and sendtype=float (yields 131072 bytes) the time such an all-to-all takes is (average over 100 runs, std deviation in () ): SENDRECV ALLTOALL on 16 PROCS 32768 floats took 0.036783 (0.008798) seconds. Min: 0.034175 max: 0.123684 SENDRECV ALLTOALL on 32 PROCS 32768 floats took 0.082687 (0.035920) seconds. Min: 0.071915 max: 0.285299 For comparison: MPI_Alltoall on 16 PROCS 32768 floats took 0.057936 (0.073605) seconds. Min: 0.027218 max: 0.275988 MPI_Alltoall on 32 PROCS 32768 floats took 0.137835 (0.100580) seconds. Min: 0.055607 max: 0.412144 The sendrecv all-to-all performs better for these message sizes, but on 32 CPUs (on 32 nodes) there is still congestion. When I try to separate the communication phases by putting an MPI_Barrier(MPI_COMM_WORLD) after the sendrecv, this makes the problem of congestion even worse: SENDRECV ALLTOALL on 32 PROCS, with Barrier: 32768 floats took 0.179162 (0.136885) seconds. Min: 0.091028 max: 0.729049 How can a barrier lead to more congestion??? Thanks in advance for helpful comments, Carsten --------------------------------------------------- Dr. Carsten Kutzner Max Planck Institute for Biophysical Chemistry Theoretical and Computational Biophysics Department Am Fassberg 11 37077 Goettingen, Germany Tel. +49-551-2012313, Fax: +49-551-2012302 eMail ckut...@gwdg.de http://www.gwdg.de/~ckutzne