Hi, Simon,
The pairwise algorithm passes messages in a synchronised ring-like
fashion
with increasing stride, so it works best when independent communication
paths could be established between several ports of the network
switch/router. Some 1 Gbps Ethernet equipment is not capable of
doing so,
some is - it depends (usually on the price). This said, not all
algorithms
perform the same given a specific type of network interconnect. For
example,
on our fat-tree InfiniBand network the pairwise algorithm performs
better.
You can switch back to the basic linear algorithm by providing the
following
MCA parameters:
mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm 1 ...
Algorithm 1 is the basic linear, which used to be the default.
Algorithm 2
is the pairwise one.
You can also set these values as exported environment variables:
export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
mpiexec ...
You can also put this in $HOME/.openmpi/mcaparams.conf or (to make
it have
global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:
coll_tuned_use_dynamic_rules=1
coll_tuned_alltoallv_algorithm=1
A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
activate process binding with --bind-to-core if you haven't already
did so.
It prevents MPI processes from being migrated to other NUMA nodes while
running.
Kind regards,
Hristo
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23, D 52074 Aachen (Germany)
-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Number Cruncher
Sent: Thursday, November 15, 2012 5:37 PM
To: Open MPI Users
Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
1.6.1
I've noticed a very significant (100%) slow down for MPI_Alltoallv
calls
as of
version 1.6.1.
* This is most noticeable for high-frequency exchanges over 1Gb
ethernet
where process-to-process message sizes are fairly small (e.g.
100kbyte)
and
much of the exchange matrix is sparse.
* 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default
algorithm
to a pairwise exchange", but I'm not clear what this means or how to
switch
back to the old "non-default algorithm".
I attach a test program which illustrates the sort of usage in our MPI
application. I have run as this as 32 processes on four nodes, over
1Gb
ethernet, each node with 2x Opteron 4180 (dual hex-core); rank
0,4,8,..
on node 1, rank 1,5,9, ... on node 2, etc.
It constructs an array of integers and a nProcess x nProcess exchange
typical
of part of our application. This is then exchanged several thousand
times.
Output from "mpicc -O3" runs shown below.
My guess is that 1.6.1 is hitting additional latency not present in
1.6.0.
I also
attach a plot showing network throughput on our actual mesh generation
application. Nodes cfsc01-04 are running 1.6.0 and finish within 35
minutes.
Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and
take over
a
hour to run. There seems to be a much greater network demand in the
1.6.1
version, despite the user-code and input data being identical.
Thanks for any help you can give,
Simon