Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

Number Cruncher Wed, 19 Dec 2012 11:31:15 -0500

On 19/12/12 11:08, Paul Kapinos wrote:

Did you *really* wanna to dig into code just in order to switch adefault communication algorithm?

No, I didn't want to, but with a huge change in performance, I'm forcedto do something! And having looked at the different algorithms, I thinkthere's a problem with the new default whenever message sizes are smallenough that connection latency will dominate. We're not all runningInfiniband, and having to wait for each pairwise exchange to completebefore initiating another seems wrong if the latency in waiting forcompletion dominates the transmission time.

E.g. If I have 10 small pairwise exchanges to perform,isn't it better toput all 10 outbound messages on the wire, and wait for 10 matchinginbound messages, in any order? The new algorithm must wait for firstexchange to complete, then second exchange, then third. Unlike before,does it not have to wait to acknowledge the matching *zero-sized*request? I don't see why this temporal ordering matters.


Thanks for your help,
Simon

Note there are several ways to set the parameters; --mca on commandline is just one of them (suitable for quick online tests).
http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

We 'tune' our Open MPI by setting environment variables....

Best
Paul Kapinos



On 12/19/12 11:44, Number Cruncher wrote:
Having run some more benchmarks, the new default is *really* bad for our
application (2-10x slower), so I've been looking at the source to tryand figure
out why.
It seems that the biggest difference will occur when the all_to_allis actuallysparse (e.g. our application); if most N-M process exchanges are zeroin sizethe 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear algorithm willactuallyonly post irecv/isend for non-zero exchanges; any zero-size exchangesareskipped. It then waits once for all requests to complete. Incontrast, the newompi_coll_tuned_alltoallv_intra_pairwise will post the zero-sizeexchanges for*every* N-M pair, and wait for each pairwise exchange. This isO(comm_size)waits, may of which are zero. I'm not clear what optimizations thereare forzero-size isend/irecv, but surely there's a great deal more latencyif eachpairwise exchange has to be confirmed complete before executing thenext?
Relatedly, how would I direct OpenMPI to use the older algorithm
programmatically? I don't want the user to have to use "--mca" in their
"mpiexec". Is there a C API?

Thanks,
Simon


On 16/11/12 10:15, Iliev, Hristo wrote:
Hi, Simon,
The pairwise algorithm passes messages in a synchronised ring-likefashion
with increasing stride, so it works best when independent communication
paths could be established between several ports of the network
switch/router. Some 1 Gbps Ethernet equipment is not capable ofdoing so,some is - it depends (usually on the price). This said, not allalgorithmsperform the same given a specific type of network interconnect. Forexample,on our fat-tree InfiniBand network the pairwise algorithm performsbetter.
You can switch back to the basic linear algorithm by providing thefollowing
MCA parameters:

mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm 1 ...
Algorithm 1 is the basic linear, which used to be the default.Algorithm 2
is the pairwise one.
You can also set these values as exported environment variables:

export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
mpiexec ...
You can also put this in $HOME/.openmpi/mcaparams.conf or (to makeit have
global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:

coll_tuned_use_dynamic_rules=1
coll_tuned_alltoallv_algorithm=1

A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
activate process binding with --bind-to-core if you haven't alreadydid so.
It prevents MPI processes from being migrated to other NUMA nodes while
running.

Kind regards,
Hristo
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23, D 52074 Aachen (Germany)
-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Number Cruncher
Sent: Thursday, November 15, 2012 5:37 PM
To: Open MPI Users
Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to1.6.1
I've noticed a very significant (100%) slow down for MPI_Alltoallvcalls
as of
version 1.6.1.
* This is most noticeable for high-frequency exchanges over 1Gbethernetwhere process-to-process message sizes are fairly small (e.g.100kbyte)
and
much of the exchange matrix is sparse.
* 1.6.1 release notes mention "Switch the MPI_ALLTOALLV defaultalgorithm
to a pairwise exchange", but I'm not clear what this means or how to
switch
back to the old "non-default algorithm".

I attach a test program which illustrates the sort of usage in our MPI
application. I have run as this as 32 processes on four nodes, over1Gbethernet, each node with 2x Opteron 4180 (dual hex-core); rank0,4,8,..
on node 1, rank 1,5,9, ... on node 2, etc.

It constructs an array of integers and a nProcess x nProcess exchange
typical
of part of our application. This is then exchanged several thousandtimes.
Output from "mpicc -O3" runs shown below.
My guess is that 1.6.1 is hitting additional latency not present in1.6.0.
I also
attach a plot showing network throughput on our actual mesh generation
application. Nodes cfsc01-04 are running 1.6.0 and finish within 35
minutes.
Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) andtake over
a
hour to run. There seems to be a much greater network demand in the1.6.1
version, despite the user-code and input data being identical.

Thanks for any help you can give,
Simon
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

Reply via email to