Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

George Bosilca Sun, 23 Dec 2012 12:49:26 -0500

I have no definitive argument. In general I have to admit that we are 
envisioning the worst case scenario, and try to come out with a solution that 
solve it when possible with minimal overhead for other cases). Keeping every 
process in sync sounded like a good approach for minimizing the unexpected 
messages burden. However, this will [again] depend on the communication pattern 
depicted by the alltoallv operation.


  George.


On Dec 22, 2012, at 12:47 , Number Cruncher <number.crunc...@ntlworld.com> 
wrote:

> Thanks for the informative response. What I'm still not clear about is 
> whether there isn't a very simple optimization for the zero-size case. If two 
> processes know they aren't exchanging *any* data (which is known from the 
> argument list for all_to_allv), isn't there any network latency or overhead 
> in the sendrecv exchanges for this zero-exchange? The previous algorithm just 
> skipped this case; couldn't the pairwise one also?
> 
> Simon
> 
> On 21/12/2012 18:53, George Bosilca wrote:
>> I can argue the opposite: in the most general case, each process will 
>> exchange data with all other processes, so a blocking approach as 
>> implemented in the current version make sense. As you noticed, this lead to 
>> poor results when the exchange pattern is sparse. We took what we believed 
>> is the most common usage of the alltoallv collective, and provided a default 
>> algorithm we consider the best for it (pairwise due to a tightly coupled 
>> structure of communications).
>> 
>> However, as one of the main developers of the collective module, I'm not 
>> insensible to your argument. I would have loved to be able to alter the 
>> behavior of the alltoallv to adapt more carefully to the collective pattern 
>> itself. Unfortunately, it is very difficult as the alltoallv is one of the 
>> few collective, where the knowledge about the communication pattern is not 
>> evenly distributed among the peers (every rank knows only about the 
>> communications where it is involved). Thus, without requiring extra 
>> communications, the only valid parameter which can affect the selection of 
>> one of the underlying implementations is the number of participants in the 
>> collective (not even the number of participants exchanging real data, but 
>> the number of participants in the communicator). Not enough to make a 
>> smartest decision.
>> 
>> As suggested several times already in this thread, there are quite a few MCA 
>> parameters that allow specialized behaviors for applications with 
>> communication patterns we did not considered as mainstream. You should 
>> definitively take advantage of these to further optimize your applications.
>> 
>> George.
>> 
>> On Dec 21, 2012, at 13:25 , Number Cruncher <number.crunc...@ntlworld.com> 
>> wrote:
>> 
>>> I completely understand there's no "one size fits all", and I appreciate 
>>> that there are workarounds to the change in behaviour. I'm only trying to 
>>> make what little contribution I can by questioning the architecture of the 
>>> pairwise algorithm.
>>> 
>>> I know that for every user you please, there will be some that aren't happy 
>>> when a default changes (Windows 8 anyone?), but I'm trying to provide some 
>>> real-world data. If 90% of apps are 10% faster and 10% are 1000% slower, 
>>> should the default change?
>>> 
>>> all_to_all is a really nice primitive from a developer point of view. Every 
>>> process' code is symmetric and identical. Maybe I should have to worry that 
>>> most of the matrix exchange is sparse; I probably could calculate an 
>>> optimal exchange pattern. But I think this is the implementation's job, and 
>>> I will continue to argue that *waiting* for each pairwise exchange is (a) 
>>> unnecessary, (b) doesn't improve performance for *any* application and (c) 
>>> at worst causes huge slowdown over the last algorithm for sparse cases.
>>> 
>>> In summary: I'm arguing that there's a problem with the pairwise 
>>> implementation as it stands. It doesn't have any optimization for sparse 
>>> all_to_all and imposes unnecessary synchronisation barriers in all cases.
>>> 
>>> Simon
>>> 
>>> 
>>> 
>>> On 20/12/2012 14:42, Iliev, Hristo wrote:
>>>> Simon,
>>>> 
>>>> The goal of any MPI implementation is to be as fast as possible.
>>>> Unfortunately there is no "one size fits all" algorithm that works on all
>>>> networks and given all possible kind of peculiarities that your specific
>>>> communication scheme may have. That's why there are different algorithms 
>>>> and
>>>> you are given the option to dynamically select them at run time without the
>>>> need to recompile the code. I don't think the change of the default
>>>> algorithm (note that the pairwise algorithm has been there for many years -
>>>> it is not new, simply the new default one) was introduced in order to piss
>>>> users off.
>>>> 
>>>> If you want OMPI to default to the previous algorithm:
>>>> 
>>>> 1) Add this to the system-wide OMPI configuration file
>>>> $sysconf/openmpi-mca-params.conf (wher $sysconf would most likely be
>>>> $PREFIX/etc, where $PREFIX is the OMPI installation directory):
>>>> coll_tuned_use_dynamic_rules = 1
>>>> coll_tuned_alltoallv_algorithm = 1
>>>> 
>>>> 2) The settings from (1) can be overridden on per user basis by the similar
>>>> settings from $HOME/.openmpi/mca-params.conf.
>>>> 
>>>> 3) The settings from (1) and (2) can be overridden on per job basis by
>>>> exporting MCA parameters as environment variables:
>>>> export OMPI_MCA_coll_tuned_use_dynamic_rules=1
>>>> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
>>>> 
>>>> 4) Finally, the settings from (1), (2), and (3) can be overridden on per 
>>>> MPI
>>>> program launch by supplying appropriate MCA parameters to orterun (a.k.a.
>>>> mpirun and mpiexec).
>>>> 
>>>> There is also a largely undocumented feature of the "tuned" collective
>>>> component where a dynamic rules file can be supplied. In the file a series
>>>> of cases tell the library which implementation to use based on the
>>>> communicator and message sizes. No idea if it works with ALLTOALLV.
>>>> 
>>>> Kind regards,
>>>> Hristo
>>>> 
>>>> (sorry for top posting - damn you, Outlook!)
>>>> --
>>>> Hristo Iliev, Ph.D. -- High Performance Computing
>>>> RWTH Aachen University, Center for Computing and Communication
>>>> Rechen- und Kommunikationszentrum der RWTH Aachen
>>>> Seffenter Weg 23,  D 52074  Aachen (Germany)
>>>> 
>>>>> -----Original Message-----
>>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>>>>> On Behalf Of Number Cruncher
>>>>> Sent: Wednesday, December 19, 2012 5:31 PM
>>>>> To: Open MPI Users
>>>>> Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
>>>>> 1.6.1
>>>>> 
>>>>> On 19/12/12 11:08, Paul Kapinos wrote:
>>>>>> Did you *really* wanna to dig into code just in order to switch a
>>>>>> default communication algorithm?
>>>>> No, I didn't want to, but with a huge change in performance, I'm forced to
>>>> do
>>>>> something! And having looked at the different algorithms, I think there's
>>>> a
>>>>> problem with the new default whenever message sizes are small enough
>>>>> that connection latency will dominate. We're not all running Infiniband,
>>>> and
>>>>> having to wait for each pairwise exchange to complete before initiating
>>>>> another seems wrong if the latency in waiting for completion dominates the
>>>>> transmission time.
>>>>> 
>>>>> E.g. If I have 10 small pairwise exchanges to perform,isn't it better to
>>>> put all
>>>>> 10 outbound messages on the wire, and wait for 10 matching inbound
>>>>> messages, in any order? The new algorithm must wait for first exchange to
>>>>> complete, then second exchange, then third. Unlike before, does it not
>>>> have
>>>>> to wait to acknowledge the matching *zero-sized* request? I don't see why
>>>>> this temporal ordering matters.
>>>>> 
>>>>> Thanks for your help,
>>>>> Simon
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> Note there are several ways to set the parameters; --mca on command
>>>>>> line is just one of them (suitable for quick online tests).
>>>>>> 
>>>>>> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
>>>>>> 
>>>>>> We 'tune' our Open MPI by setting environment variables....
>>>>>> 
>>>>>> Best
>>>>>> Paul Kapinos
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 12/19/12 11:44, Number Cruncher wrote:
>>>>>>> Having run some more benchmarks, the new default is *really* bad for
>>>>>>> our application (2-10x slower), so I've been looking at the source to
>>>>>>> try and figure out why.
>>>>>>> 
>>>>>>> It seems that the biggest difference will occur when the all_to_all
>>>>>>> is actually sparse (e.g. our application); if most N-M process
>>>>>>> exchanges are zero in size the 1.6
>>>>>>> ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually
>>>>>>> only post irecv/isend for non-zero exchanges; any zero-size exchanges
>>>>>>> are skipped. It then waits once for all requests to complete. In
>>>>>>> contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post
>>>>>>> the zero-size exchanges for
>>>>>>> *every* N-M pair, and wait for each pairwise exchange. This is
>>>>>>> O(comm_size)
>>>>>>> waits, may of which are zero. I'm not clear what optimizations there
>>>>>>> are for zero-size isend/irecv, but surely there's a great deal more
>>>>>>> latency if each pairwise exchange has to be confirmed complete before
>>>>>>> executing the next?
>>>>>>> 
>>>>>>> Relatedly, how would I direct OpenMPI to use the older algorithm
>>>>>>> programmatically? I don't want the user to have to use "--mca" in
>>>>>>> their "mpiexec". Is there a C API?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Simon
>>>>>>> 
>>>>>>> 
>>>>>>> On 16/11/12 10:15, Iliev, Hristo wrote:
>>>>>>>> Hi, Simon,
>>>>>>>> 
>>>>>>>> The pairwise algorithm passes messages in a synchronised ring-like
>>>>>>>> fashion with increasing stride, so it works best when independent
>>>>>>>> communication paths could be established between several ports of
>>>>>>>> the network switch/router. Some 1 Gbps Ethernet equipment is not
>>>>>>>> capable of doing so, some is - it depends (usually on the price).
>>>>>>>> This said, not all algorithms perform the same given a specific type
>>>>>>>> of network interconnect. For example, on our fat-tree InfiniBand
>>>>>>>> network the pairwise algorithm performs better.
>>>>>>>> 
>>>>>>>> You can switch back to the basic linear algorithm by providing the
>>>>>>>> following MCA parameters:
>>>>>>>> 
>>>>>>>> mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
>>>>>>>> coll_tuned_alltoallv_algorithm 1 ...
>>>>>>>> 
>>>>>>>> Algorithm 1 is the basic linear, which used to be the default.
>>>>>>>> Algorithm 2
>>>>>>>> is the pairwise one.
>>>>>>>> You can also set these values as exported environment variables:
>>>>>>>> 
>>>>>>>> export OMPI_MCA_coll_tuned_use_dynamic_rules=1
>>>>>>>> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
>>>>>>>> mpiexec ...
>>>>>>>> 
>>>>>>>> You can also put this in $HOME/.openmpi/mcaparams.conf or (to make
>>>>>>>> it have global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:
>>>>>>>> 
>>>>>>>> coll_tuned_use_dynamic_rules=1
>>>>>>>> coll_tuned_alltoallv_algorithm=1
>>>>>>>> 
>>>>>>>> A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense
>>>>>>>> to activate process binding with --bind-to-core if you haven't
>>>>>>>> already did so.
>>>>>>>> It prevents MPI processes from being migrated to other NUMA nodes
>>>>>>>> while running.
>>>>>>>> 
>>>>>>>> Kind regards,
>>>>>>>> Hristo
>>>>>>>> --
>>>>>>>> Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen
>>>>>>>> University, Center for Computing and Communication
>>>>>>>> Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg
>>>>> 23,
>>>>>>>> D 52074 Aachen (Germany)
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: users-boun...@open-mpi.org
>>>>>>>>> [mailto:users-boun...@open-mpi.org]
>>>>>>>>> On Behalf Of Number Cruncher
>>>>>>>>> Sent: Thursday, November 15, 2012 5:37 PM
>>>>>>>>> To: Open MPI Users
>>>>>>>>> Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to
>>>>>>>>> 1.6.1
>>>>>>>>> 
>>>>>>>>> I've noticed a very significant (100%) slow down for MPI_Alltoallv
>>>>>>>>> calls
>>>>>>>> as of
>>>>>>>>> version 1.6.1.
>>>>>>>>> * This is most noticeable for high-frequency exchanges over 1Gb
>>>>>>>>> ethernet where process-to-process message sizes are fairly small
>>>>>>>>> (e.g.
>>>>>>>>> 100kbyte)
>>>>>>>> and
>>>>>>>>> much of the exchange matrix is sparse.
>>>>>>>>> * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default
>>>>>>>>> algorithm to a pairwise exchange", but I'm not clear what this
>>>>>>>>> means or how to
>>>>>>>> switch
>>>>>>>>> back to the old "non-default algorithm".
>>>>>>>>> 
>>>>>>>>> I attach a test program which illustrates the sort of usage in our
>>>>>>>>> MPI application. I have run as this as 32 processes on four nodes,
>>>>>>>>> over 1Gb ethernet, each node with 2x Opteron 4180 (dual hex-core);
>>>>>>>>> rank 0,4,8,..
>>>>>>>>> on node 1, rank 1,5,9, ... on node 2, etc.
>>>>>>>>> 
>>>>>>>>> It constructs an array of integers and a nProcess x nProcess
>>>>>>>>> exchange
>>>>>>>> typical
>>>>>>>>> of part of our application. This is then exchanged several thousand
>>>>>>>>> times.
>>>>>>>>> Output from "mpicc -O3" runs shown below.
>>>>>>>>> 
>>>>>>>>> My guess is that 1.6.1 is hitting additional latency not present in
>>>>>>>>> 1.6.0.
>>>>>>>> I also
>>>>>>>>> attach a plot showing network throughput on our actual mesh
>>>>>>>>> generation application. Nodes cfsc01-04 are running 1.6.0 and
>>>>>>>>> finish within 35
>>>>>>>> minutes.
>>>>>>>>> Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and
>>>>>>>>> take over
>>>>>>>> a
>>>>>>>>> hour to run. There seems to be a much greater network demand in the
>>>>>>>>> 1.6.1
>>>>>>>>> version, despite the user-code and input data being identical.
>>>>>>>>> 
>>>>>>>>> Thanks for any help you can give,
>>>>>>>>> Simon
>>>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

Reply via email to