Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

Iliev, Hristo Fri, 16 Nov 2012 05:15:58 -0500

Hi, Simon,

The pairwise algorithm passes messages in a synchronised ring-like fashion
with increasing stride, so it works best when independent communication
paths could be established between several ports of the network
switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so,
some is - it depends (usually on the price). This said, not all algorithms
perform the same given a specific type of network interconnect. For example,
on our fat-tree InfiniBand network the pairwise algorithm performs better.


You can switch back to the basic linear algorithm by providing the following
MCA parameters:

mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm 1 ...

Algorithm 1 is the basic linear, which used to be the default. Algorithm 2
is the pairwise one.

You can also set  these values as exported environment variables:

export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
mpiexec ...

You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have
global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:

coll_tuned_use_dynamic_rules=1
coll_tuned_alltoallv_algorithm=1

A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
activate process binding with --bind-to-core if you haven't already did so.
It prevents MPI processes from being migrated to other NUMA nodes while
running.

Kind regards,
Hristo
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23,  D 52074  Aachen (Germany)


> -----Original Message-----
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
> On Behalf Of Number Cruncher
> Sent: Thursday, November 15, 2012 5:37 PM
> To: Open MPI Users
> Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
> 
> I've noticed a very significant (100%) slow down for MPI_Alltoallv calls
as of
> version 1.6.1.
> * This is most noticeable for high-frequency exchanges over 1Gb ethernet
> where process-to-process message sizes are fairly small (e.g. 100kbyte)
and
> much of the exchange matrix is sparse.
> * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm
> to a pairwise exchange", but I'm not clear what this means or how to
switch
> back to the old "non-default algorithm".
> 
> I attach a test program which illustrates the sort of usage in our MPI
> application. I have run as this as 32 processes on four nodes, over 1Gb
> ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,..
> on node 1, rank 1,5,9, ... on node 2, etc.
> 
> It constructs an array of integers and a nProcess x nProcess exchange
typical
> of part of our application. This is then exchanged several thousand times.
> Output from "mpicc -O3" runs shown below.
> 
> My guess is that 1.6.1 is hitting additional latency not present in 1.6.0.
I also
> attach a plot showing network throughput on our actual mesh generation
> application. Nodes cfsc01-04 are running 1.6.0 and finish within 35
minutes.
> Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over
a
> hour to run. There seems to be a much greater network demand in the 1.6.1
> version, despite the user-code and input data being identical.
> 
> Thanks for any help you can give,
> Simon
> 
> For 1.6.0:
> 
> Open MPI 1.6.0
> Proc  0: 50 38 29 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0 0
> 0                                  Total: 198 x 100 int
> Proc  1: 38 29 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                     Total: 148 x 100 int
> Proc  2: 29 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                        Total: 109 x 100 int
> Proc  3: 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                           Total: 80 x 100 int
> Proc  4: 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                              Total: 58 x 100 int
> Proc  5: 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                                 Total: 41 x 100 int
> Proc  6:  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                                    Total: 29 x 100 int
> Proc  7:  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                                       Total: 20 x 100
int
> Proc  8:  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                                          Total: 14 x
> 100 int
> Proc  9:  3  2  1  0  0  0  0  0  0  0  0
> 0                                                             Total: 9 x
> 100 int
> Proc 10:  2  1  0  0  0  0  0  0  0  0 0 Total: 6 x 100 int Proc 11:  1  0
0  0  0  0  0  0  0
> 0                                                                 0
> Total: 4 x 100 int
> Proc 12:  0  0  0  0  0  0  0  0
> 0                                                                 0 0
> Total: 2 x 100 int
> Proc 13:  0  0  0  0  0  0  0
> 0                                                                 0 0  0
> Total: 1 x 100 int
> Proc 14:  0  0  0  0  0  0
> 0                                                                 0 0
> 0  0 Total: 1 x 100 int
> Proc 15:  0  0  0  0  0
> 0                                                                 0 0
> 0  0  0 Total: 0 x 100 int
> Proc 16:  0  0  0  0
> 0                                                                 0 0
> 0  0  0  0 Total: 0 x 100 int
> Proc 17:  0  0  0
> 0                                                                 0 0
> 0  0  0  0  0 Total: 1 x 100 int
> Proc 18:  0  0
> 0                                                                 0 0
> 0  0  0  0  0  0 Total: 1 x 100 int
> Proc 19:  0
> 0                                                                 0 0
> 0  0  0  0  0  0  0 Total: 2 x 100 int
> Proc 20:
> 0                                                                 0 0
> 0  0  0  0  0  0  0  1 Total: 4 x 100 int Proc 21: 0  0  0  0  0  0  0  0
0  1  2 Total: 6 x 100
> int
> Proc 22:                                                              0
> 0  0  0  0  0  0  0  0  1  2  3 Total: 9 x 100 int Proc 23: 0  0  0  0  0
0  0  0  0  1  2  3  4
> Total: 14 x 100 int
> Proc 24:                                                        0 0  0
> 0  0  0  0  0  0  1  2  3  4  6 Total: 20 x 100 int
> Proc 25:                                                     0  0 0  0
> 0  0  0  0  0  1  2  3  4  6  8 Total: 29 x 100 int
> Proc 26:                                                  0  0  0 0  0
> 0  0  0  0  1  2  3  4  6  8 12 Total: 41 x 100 int
> Proc 27:                                               0  0  0  0 0  0
> 0  0  0  1  2  3  4  6  8 12 16 Total: 58 x 100 int
> Proc 28:                                            0  0  0  0  0 0  0
> 0  0  1  2  3  4  6  8 12 16 22 Total: 80 x 100 int
> Proc 29:                                         0  0  0  0  0  0 0  0
> 0  1  2  3  4  6  8 12 16 22 29 Total: 109 x 100 int
> Proc 30:                                      0  0  0  0  0  0  0 0  0
> 1  2  3  4  6  8 12 16 22 29 38 Total: 148 x 100 int
> Proc 31:                                   0  0  0  0  0  0  0  0 0  1
> 2  3  4  6  8 12 16 22 29 38 50 Total: 198 x 100 int
>
............................................................................
........................
> Total time = 15.443502 seconds
> 
> 
> For 1.6.1:
> 
> Open MPI 1.6.1
> Proc  0: 50 38 29 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0 0
> 0                                  Total: 198 x 100 int
> Proc  1: 38 29 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                     Total: 148 x 100 int
> Proc  2: 29 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                        Total: 109 x 100 int
> Proc  3: 22 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                           Total: 80 x 100 int
> Proc  4: 16 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                              Total: 58 x 100 int
> Proc  5: 12  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                                 Total: 41 x 100 int
> Proc  6:  8  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                                    Total: 29 x 100 int
> Proc  7:  6  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                                       Total: 20 x 100
int
> Proc  8:  4  3  2  1  0  0  0  0  0  0  0  0
> 0                                                          Total: 14 x
> 100 int
> Proc  9:  3  2  1  0  0  0  0  0  0  0  0
> 0                                                             Total: 9 x
> 100 int
> Proc 10:  2  1  0  0  0  0  0  0  0  0 0 Total: 6 x 100 int Proc 11:  1  0
0  0  0  0  0  0  0
> 0                                                                 0
> Total: 4 x 100 int
> Proc 12:  0  0  0  0  0  0  0  0
> 0                                                                 0 0
> Total: 2 x 100 int
> Proc 13:  0  0  0  0  0  0  0
> 0                                                                 0 0  0
> Total: 1 x 100 int
> Proc 14:  0  0  0  0  0  0
> 0                                                                 0 0
> 0  0 Total: 1 x 100 int
> Proc 15:  0  0  0  0  0
> 0                                                                 0 0
> 0  0  0 Total: 0 x 100 int
> Proc 16:  0  0  0  0
> 0                                                                 0 0
> 0  0  0  0 Total: 0 x 100 int
> Proc 17:  0  0  0
> 0                                                                 0 0
> 0  0  0  0  0 Total: 1 x 100 int
> Proc 18:  0  0
> 0                                                                 0 0
> 0  0  0  0  0  0 Total: 1 x 100 int
> Proc 19:  0
> 0                                                                 0 0
> 0  0  0  0  0  0  0 Total: 2 x 100 int
> Proc 20:
> 0                                                                 0 0
> 0  0  0  0  0  0  0  1 Total: 4 x 100 int Proc 21: 0  0  0  0  0  0  0  0
0  1  2 Total: 6 x 100
> int
> Proc 22:                                                              0
> 0  0  0  0  0  0  0  0  1  2  3 Total: 9 x 100 int Proc 23: 0  0  0  0  0
0  0  0  0  1  2  3  4
> Total: 14 x 100 int
> Proc 24:                                                        0 0  0
> 0  0  0  0  0  0  1  2  3  4  6 Total: 20 x 100 int
> Proc 25:                                                     0  0 0  0
> 0  0  0  0  0  1  2  3  4  6  8 Total: 29 x 100 int
> Proc 26:                                                  0  0  0 0  0
> 0  0  0  0  1  2  3  4  6  8 12 Total: 41 x 100 int
> Proc 27:                                               0  0  0  0 0  0
> 0  0  0  1  2  3  4  6  8 12 16 Total: 58 x 100 int
> Proc 28:                                            0  0  0  0  0 0  0
> 0  0  1  2  3  4  6  8 12 16 22 Total: 80 x 100 int
> Proc 29:                                         0  0  0  0  0  0 0  0
> 0  1  2  3  4  6  8 12 16 22 29 Total: 109 x 100 int
> Proc 30:                                      0  0  0  0  0  0  0 0  0
> 1  2  3  4  6  8 12 16 22 29 38 Total: 148 x 100 int
> Proc 31:                                   0  0  0  0  0  0  0  0 0  1
> 2  3  4  6  8 12 16 22 29 38 50 Total: 198 x 100 int
>
............................................................................
........................
> Total time = 25.549821 seconds

smime.p7s
Description: S/MIME cryptographic signature

Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

Reply via email to