Hi, Simon, The pairwise algorithm passes messages in a synchronised ring-like fashion with increasing stride, so it works best when independent communication paths could be established between several ports of the network switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so, some is - it depends (usually on the price). This said, not all algorithms perform the same given a specific type of network interconnect. For example, on our fat-tree InfiniBand network the pairwise algorithm performs better.
You can switch back to the basic linear algorithm by providing the following MCA parameters: mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_alltoallv_algorithm 1 ... Algorithm 1 is the basic linear, which used to be the default. Algorithm 2 is the pairwise one. You can also set these values as exported environment variables: export OMPI_MCA_coll_tuned_use_dynamic_rules=1 export OMPI_MCA_coll_tuned_alltoallv_algorithm=1 mpiexec ... You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf: coll_tuned_use_dynamic_rules=1 coll_tuned_alltoallv_algorithm=1 A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to activate process binding with --bind-to-core if you haven't already did so. It prevents MPI processes from being migrated to other NUMA nodes while running. Kind regards, Hristo -- Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen University, Center for Computing and Communication Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg 23, D 52074 Aachen (Germany) > -----Original Message----- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] > On Behalf Of Number Cruncher > Sent: Thursday, November 15, 2012 5:37 PM > To: Open MPI Users > Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1 > > I've noticed a very significant (100%) slow down for MPI_Alltoallv calls as of > version 1.6.1. > * This is most noticeable for high-frequency exchanges over 1Gb ethernet > where process-to-process message sizes are fairly small (e.g. 100kbyte) and > much of the exchange matrix is sparse. > * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm > to a pairwise exchange", but I'm not clear what this means or how to switch > back to the old "non-default algorithm". > > I attach a test program which illustrates the sort of usage in our MPI > application. I have run as this as 32 processes on four nodes, over 1Gb > ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,.. > on node 1, rank 1,5,9, ... on node 2, etc. > > It constructs an array of integers and a nProcess x nProcess exchange typical > of part of our application. This is then exchanged several thousand times. > Output from "mpicc -O3" runs shown below. > > My guess is that 1.6.1 is hitting additional latency not present in 1.6.0. I also > attach a plot showing network throughput on our actual mesh generation > application. Nodes cfsc01-04 are running 1.6.0 and finish within 35 minutes. > Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over a > hour to run. There seems to be a much greater network demand in the 1.6.1 > version, despite the user-code and input data being identical. > > Thanks for any help you can give, > Simon > > For 1.6.0: > > Open MPI 1.6.0 > Proc 0: 50 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 198 x 100 int > Proc 1: 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 148 x 100 int > Proc 2: 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 109 x 100 int > Proc 3: 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 80 x 100 int > Proc 4: 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 58 x 100 int > Proc 5: 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 41 x 100 int > Proc 6: 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 29 x 100 int > Proc 7: 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 20 x 100 int > Proc 8: 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 14 x > 100 int > Proc 9: 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 9 x > 100 int > Proc 10: 2 1 0 0 0 0 0 0 0 0 0 Total: 6 x 100 int Proc 11: 1 0 0 0 0 0 0 0 0 > 0 0 > Total: 4 x 100 int > Proc 12: 0 0 0 0 0 0 0 0 > 0 0 0 > Total: 2 x 100 int > Proc 13: 0 0 0 0 0 0 0 > 0 0 0 0 > Total: 1 x 100 int > Proc 14: 0 0 0 0 0 0 > 0 0 0 > 0 0 Total: 1 x 100 int > Proc 15: 0 0 0 0 0 > 0 0 0 > 0 0 0 Total: 0 x 100 int > Proc 16: 0 0 0 0 > 0 0 0 > 0 0 0 0 Total: 0 x 100 int > Proc 17: 0 0 0 > 0 0 0 > 0 0 0 0 0 Total: 1 x 100 int > Proc 18: 0 0 > 0 0 0 > 0 0 0 0 0 0 Total: 1 x 100 int > Proc 19: 0 > 0 0 0 > 0 0 0 0 0 0 0 Total: 2 x 100 int > Proc 20: > 0 0 0 > 0 0 0 0 0 0 0 1 Total: 4 x 100 int Proc 21: 0 0 0 0 0 0 0 0 0 1 2 Total: 6 x 100 > int > Proc 22: 0 > 0 0 0 0 0 0 0 0 1 2 3 Total: 9 x 100 int Proc 23: 0 0 0 0 0 0 0 0 0 1 2 3 4 > Total: 14 x 100 int > Proc 24: 0 0 0 > 0 0 0 0 0 0 1 2 3 4 6 Total: 20 x 100 int > Proc 25: 0 0 0 0 > 0 0 0 0 0 1 2 3 4 6 8 Total: 29 x 100 int > Proc 26: 0 0 0 0 0 > 0 0 0 0 1 2 3 4 6 8 12 Total: 41 x 100 int > Proc 27: 0 0 0 0 0 0 > 0 0 0 1 2 3 4 6 8 12 16 Total: 58 x 100 int > Proc 28: 0 0 0 0 0 0 0 > 0 0 1 2 3 4 6 8 12 16 22 Total: 80 x 100 int > Proc 29: 0 0 0 0 0 0 0 0 > 0 1 2 3 4 6 8 12 16 22 29 Total: 109 x 100 int > Proc 30: 0 0 0 0 0 0 0 0 0 > 1 2 3 4 6 8 12 16 22 29 38 Total: 148 x 100 int > Proc 31: 0 0 0 0 0 0 0 0 0 1 > 2 3 4 6 8 12 16 22 29 38 50 Total: 198 x 100 int > ............................................................................ ........................ > Total time = 15.443502 seconds > > > For 1.6.1: > > Open MPI 1.6.1 > Proc 0: 50 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 198 x 100 int > Proc 1: 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 148 x 100 int > Proc 2: 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 109 x 100 int > Proc 3: 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 80 x 100 int > Proc 4: 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 58 x 100 int > Proc 5: 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 41 x 100 int > Proc 6: 8 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 29 x 100 int > Proc 7: 6 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 20 x 100 int > Proc 8: 4 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 14 x > 100 int > Proc 9: 3 2 1 0 0 0 0 0 0 0 0 > 0 Total: 9 x > 100 int > Proc 10: 2 1 0 0 0 0 0 0 0 0 0 Total: 6 x 100 int Proc 11: 1 0 0 0 0 0 0 0 0 > 0 0 > Total: 4 x 100 int > Proc 12: 0 0 0 0 0 0 0 0 > 0 0 0 > Total: 2 x 100 int > Proc 13: 0 0 0 0 0 0 0 > 0 0 0 0 > Total: 1 x 100 int > Proc 14: 0 0 0 0 0 0 > 0 0 0 > 0 0 Total: 1 x 100 int > Proc 15: 0 0 0 0 0 > 0 0 0 > 0 0 0 Total: 0 x 100 int > Proc 16: 0 0 0 0 > 0 0 0 > 0 0 0 0 Total: 0 x 100 int > Proc 17: 0 0 0 > 0 0 0 > 0 0 0 0 0 Total: 1 x 100 int > Proc 18: 0 0 > 0 0 0 > 0 0 0 0 0 0 Total: 1 x 100 int > Proc 19: 0 > 0 0 0 > 0 0 0 0 0 0 0 Total: 2 x 100 int > Proc 20: > 0 0 0 > 0 0 0 0 0 0 0 1 Total: 4 x 100 int Proc 21: 0 0 0 0 0 0 0 0 0 1 2 Total: 6 x 100 > int > Proc 22: 0 > 0 0 0 0 0 0 0 0 1 2 3 Total: 9 x 100 int Proc 23: 0 0 0 0 0 0 0 0 0 1 2 3 4 > Total: 14 x 100 int > Proc 24: 0 0 0 > 0 0 0 0 0 0 1 2 3 4 6 Total: 20 x 100 int > Proc 25: 0 0 0 0 > 0 0 0 0 0 1 2 3 4 6 8 Total: 29 x 100 int > Proc 26: 0 0 0 0 0 > 0 0 0 0 1 2 3 4 6 8 12 Total: 41 x 100 int > Proc 27: 0 0 0 0 0 0 > 0 0 0 1 2 3 4 6 8 12 16 Total: 58 x 100 int > Proc 28: 0 0 0 0 0 0 0 > 0 0 1 2 3 4 6 8 12 16 22 Total: 80 x 100 int > Proc 29: 0 0 0 0 0 0 0 0 > 0 1 2 3 4 6 8 12 16 22 29 Total: 109 x 100 int > Proc 30: 0 0 0 0 0 0 0 0 0 > 1 2 3 4 6 8 12 16 22 29 38 Total: 148 x 100 int > Proc 31: 0 0 0 0 0 0 0 0 0 1 > 2 3 4 6 8 12 16 22 29 38 50 Total: 198 x 100 int > ............................................................................ ........................ > Total time = 25.549821 seconds
smime.p7s
Description: S/MIME cryptographic signature