I have no definitive argument. In general I have to admit that we are envisioning the worst case scenario, and try to come out with a solution that solve it when possible with minimal overhead for other cases). Keeping every process in sync sounded like a good approach for minimizing the unexpected messages burden. However, this will [again] depend on the communication pattern depicted by the alltoallv operation.
George. On Dec 22, 2012, at 12:47 , Number Cruncher <number.crunc...@ntlworld.com> wrote: > Thanks for the informative response. What I'm still not clear about is > whether there isn't a very simple optimization for the zero-size case. If two > processes know they aren't exchanging *any* data (which is known from the > argument list for all_to_allv), isn't there any network latency or overhead > in the sendrecv exchanges for this zero-exchange? The previous algorithm just > skipped this case; couldn't the pairwise one also? > > Simon > > On 21/12/2012 18:53, George Bosilca wrote: >> I can argue the opposite: in the most general case, each process will >> exchange data with all other processes, so a blocking approach as >> implemented in the current version make sense. As you noticed, this lead to >> poor results when the exchange pattern is sparse. We took what we believed >> is the most common usage of the alltoallv collective, and provided a default >> algorithm we consider the best for it (pairwise due to a tightly coupled >> structure of communications). >> >> However, as one of the main developers of the collective module, I'm not >> insensible to your argument. I would have loved to be able to alter the >> behavior of the alltoallv to adapt more carefully to the collective pattern >> itself. Unfortunately, it is very difficult as the alltoallv is one of the >> few collective, where the knowledge about the communication pattern is not >> evenly distributed among the peers (every rank knows only about the >> communications where it is involved). Thus, without requiring extra >> communications, the only valid parameter which can affect the selection of >> one of the underlying implementations is the number of participants in the >> collective (not even the number of participants exchanging real data, but >> the number of participants in the communicator). Not enough to make a >> smartest decision. >> >> As suggested several times already in this thread, there are quite a few MCA >> parameters that allow specialized behaviors for applications with >> communication patterns we did not considered as mainstream. You should >> definitively take advantage of these to further optimize your applications. >> >> George. >> >> On Dec 21, 2012, at 13:25 , Number Cruncher <number.crunc...@ntlworld.com> >> wrote: >> >>> I completely understand there's no "one size fits all", and I appreciate >>> that there are workarounds to the change in behaviour. I'm only trying to >>> make what little contribution I can by questioning the architecture of the >>> pairwise algorithm. >>> >>> I know that for every user you please, there will be some that aren't happy >>> when a default changes (Windows 8 anyone?), but I'm trying to provide some >>> real-world data. If 90% of apps are 10% faster and 10% are 1000% slower, >>> should the default change? >>> >>> all_to_all is a really nice primitive from a developer point of view. Every >>> process' code is symmetric and identical. Maybe I should have to worry that >>> most of the matrix exchange is sparse; I probably could calculate an >>> optimal exchange pattern. But I think this is the implementation's job, and >>> I will continue to argue that *waiting* for each pairwise exchange is (a) >>> unnecessary, (b) doesn't improve performance for *any* application and (c) >>> at worst causes huge slowdown over the last algorithm for sparse cases. >>> >>> In summary: I'm arguing that there's a problem with the pairwise >>> implementation as it stands. It doesn't have any optimization for sparse >>> all_to_all and imposes unnecessary synchronisation barriers in all cases. >>> >>> Simon >>> >>> >>> >>> On 20/12/2012 14:42, Iliev, Hristo wrote: >>>> Simon, >>>> >>>> The goal of any MPI implementation is to be as fast as possible. >>>> Unfortunately there is no "one size fits all" algorithm that works on all >>>> networks and given all possible kind of peculiarities that your specific >>>> communication scheme may have. That's why there are different algorithms >>>> and >>>> you are given the option to dynamically select them at run time without the >>>> need to recompile the code. I don't think the change of the default >>>> algorithm (note that the pairwise algorithm has been there for many years - >>>> it is not new, simply the new default one) was introduced in order to piss >>>> users off. >>>> >>>> If you want OMPI to default to the previous algorithm: >>>> >>>> 1) Add this to the system-wide OMPI configuration file >>>> $sysconf/openmpi-mca-params.conf (wher $sysconf would most likely be >>>> $PREFIX/etc, where $PREFIX is the OMPI installation directory): >>>> coll_tuned_use_dynamic_rules = 1 >>>> coll_tuned_alltoallv_algorithm = 1 >>>> >>>> 2) The settings from (1) can be overridden on per user basis by the similar >>>> settings from $HOME/.openmpi/mca-params.conf. >>>> >>>> 3) The settings from (1) and (2) can be overridden on per job basis by >>>> exporting MCA parameters as environment variables: >>>> export OMPI_MCA_coll_tuned_use_dynamic_rules=1 >>>> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1 >>>> >>>> 4) Finally, the settings from (1), (2), and (3) can be overridden on per >>>> MPI >>>> program launch by supplying appropriate MCA parameters to orterun (a.k.a. >>>> mpirun and mpiexec). >>>> >>>> There is also a largely undocumented feature of the "tuned" collective >>>> component where a dynamic rules file can be supplied. In the file a series >>>> of cases tell the library which implementation to use based on the >>>> communicator and message sizes. No idea if it works with ALLTOALLV. >>>> >>>> Kind regards, >>>> Hristo >>>> >>>> (sorry for top posting - damn you, Outlook!) >>>> -- >>>> Hristo Iliev, Ph.D. -- High Performance Computing >>>> RWTH Aachen University, Center for Computing and Communication >>>> Rechen- und Kommunikationszentrum der RWTH Aachen >>>> Seffenter Weg 23, D 52074 Aachen (Germany) >>>> >>>>> -----Original Message----- >>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] >>>>> On Behalf Of Number Cruncher >>>>> Sent: Wednesday, December 19, 2012 5:31 PM >>>>> To: Open MPI Users >>>>> Subject: Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to >>>>> 1.6.1 >>>>> >>>>> On 19/12/12 11:08, Paul Kapinos wrote: >>>>>> Did you *really* wanna to dig into code just in order to switch a >>>>>> default communication algorithm? >>>>> No, I didn't want to, but with a huge change in performance, I'm forced to >>>> do >>>>> something! And having looked at the different algorithms, I think there's >>>> a >>>>> problem with the new default whenever message sizes are small enough >>>>> that connection latency will dominate. We're not all running Infiniband, >>>> and >>>>> having to wait for each pairwise exchange to complete before initiating >>>>> another seems wrong if the latency in waiting for completion dominates the >>>>> transmission time. >>>>> >>>>> E.g. If I have 10 small pairwise exchanges to perform,isn't it better to >>>> put all >>>>> 10 outbound messages on the wire, and wait for 10 matching inbound >>>>> messages, in any order? The new algorithm must wait for first exchange to >>>>> complete, then second exchange, then third. Unlike before, does it not >>>> have >>>>> to wait to acknowledge the matching *zero-sized* request? I don't see why >>>>> this temporal ordering matters. >>>>> >>>>> Thanks for your help, >>>>> Simon >>>>> >>>>> >>>>> >>>>> >>>>>> Note there are several ways to set the parameters; --mca on command >>>>>> line is just one of them (suitable for quick online tests). >>>>>> >>>>>> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params >>>>>> >>>>>> We 'tune' our Open MPI by setting environment variables.... >>>>>> >>>>>> Best >>>>>> Paul Kapinos >>>>>> >>>>>> >>>>>> >>>>>> On 12/19/12 11:44, Number Cruncher wrote: >>>>>>> Having run some more benchmarks, the new default is *really* bad for >>>>>>> our application (2-10x slower), so I've been looking at the source to >>>>>>> try and figure out why. >>>>>>> >>>>>>> It seems that the biggest difference will occur when the all_to_all >>>>>>> is actually sparse (e.g. our application); if most N-M process >>>>>>> exchanges are zero in size the 1.6 >>>>>>> ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually >>>>>>> only post irecv/isend for non-zero exchanges; any zero-size exchanges >>>>>>> are skipped. It then waits once for all requests to complete. In >>>>>>> contrast, the new ompi_coll_tuned_alltoallv_intra_pairwise will post >>>>>>> the zero-size exchanges for >>>>>>> *every* N-M pair, and wait for each pairwise exchange. This is >>>>>>> O(comm_size) >>>>>>> waits, may of which are zero. I'm not clear what optimizations there >>>>>>> are for zero-size isend/irecv, but surely there's a great deal more >>>>>>> latency if each pairwise exchange has to be confirmed complete before >>>>>>> executing the next? >>>>>>> >>>>>>> Relatedly, how would I direct OpenMPI to use the older algorithm >>>>>>> programmatically? I don't want the user to have to use "--mca" in >>>>>>> their "mpiexec". Is there a C API? >>>>>>> >>>>>>> Thanks, >>>>>>> Simon >>>>>>> >>>>>>> >>>>>>> On 16/11/12 10:15, Iliev, Hristo wrote: >>>>>>>> Hi, Simon, >>>>>>>> >>>>>>>> The pairwise algorithm passes messages in a synchronised ring-like >>>>>>>> fashion with increasing stride, so it works best when independent >>>>>>>> communication paths could be established between several ports of >>>>>>>> the network switch/router. Some 1 Gbps Ethernet equipment is not >>>>>>>> capable of doing so, some is - it depends (usually on the price). >>>>>>>> This said, not all algorithms perform the same given a specific type >>>>>>>> of network interconnect. For example, on our fat-tree InfiniBand >>>>>>>> network the pairwise algorithm performs better. >>>>>>>> >>>>>>>> You can switch back to the basic linear algorithm by providing the >>>>>>>> following MCA parameters: >>>>>>>> >>>>>>>> mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca >>>>>>>> coll_tuned_alltoallv_algorithm 1 ... >>>>>>>> >>>>>>>> Algorithm 1 is the basic linear, which used to be the default. >>>>>>>> Algorithm 2 >>>>>>>> is the pairwise one. >>>>>>>> You can also set these values as exported environment variables: >>>>>>>> >>>>>>>> export OMPI_MCA_coll_tuned_use_dynamic_rules=1 >>>>>>>> export OMPI_MCA_coll_tuned_alltoallv_algorithm=1 >>>>>>>> mpiexec ... >>>>>>>> >>>>>>>> You can also put this in $HOME/.openmpi/mcaparams.conf or (to make >>>>>>>> it have global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf: >>>>>>>> >>>>>>>> coll_tuned_use_dynamic_rules=1 >>>>>>>> coll_tuned_alltoallv_algorithm=1 >>>>>>>> >>>>>>>> A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense >>>>>>>> to activate process binding with --bind-to-core if you haven't >>>>>>>> already did so. >>>>>>>> It prevents MPI processes from being migrated to other NUMA nodes >>>>>>>> while running. >>>>>>>> >>>>>>>> Kind regards, >>>>>>>> Hristo >>>>>>>> -- >>>>>>>> Hristo Iliev, Ph.D. -- High Performance Computing RWTH Aachen >>>>>>>> University, Center for Computing and Communication >>>>>>>> Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg >>>>> 23, >>>>>>>> D 52074 Aachen (Germany) >>>>>>>> >>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: users-boun...@open-mpi.org >>>>>>>>> [mailto:users-boun...@open-mpi.org] >>>>>>>>> On Behalf Of Number Cruncher >>>>>>>>> Sent: Thursday, November 15, 2012 5:37 PM >>>>>>>>> To: Open MPI Users >>>>>>>>> Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to >>>>>>>>> 1.6.1 >>>>>>>>> >>>>>>>>> I've noticed a very significant (100%) slow down for MPI_Alltoallv >>>>>>>>> calls >>>>>>>> as of >>>>>>>>> version 1.6.1. >>>>>>>>> * This is most noticeable for high-frequency exchanges over 1Gb >>>>>>>>> ethernet where process-to-process message sizes are fairly small >>>>>>>>> (e.g. >>>>>>>>> 100kbyte) >>>>>>>> and >>>>>>>>> much of the exchange matrix is sparse. >>>>>>>>> * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default >>>>>>>>> algorithm to a pairwise exchange", but I'm not clear what this >>>>>>>>> means or how to >>>>>>>> switch >>>>>>>>> back to the old "non-default algorithm". >>>>>>>>> >>>>>>>>> I attach a test program which illustrates the sort of usage in our >>>>>>>>> MPI application. I have run as this as 32 processes on four nodes, >>>>>>>>> over 1Gb ethernet, each node with 2x Opteron 4180 (dual hex-core); >>>>>>>>> rank 0,4,8,.. >>>>>>>>> on node 1, rank 1,5,9, ... on node 2, etc. >>>>>>>>> >>>>>>>>> It constructs an array of integers and a nProcess x nProcess >>>>>>>>> exchange >>>>>>>> typical >>>>>>>>> of part of our application. This is then exchanged several thousand >>>>>>>>> times. >>>>>>>>> Output from "mpicc -O3" runs shown below. >>>>>>>>> >>>>>>>>> My guess is that 1.6.1 is hitting additional latency not present in >>>>>>>>> 1.6.0. >>>>>>>> I also >>>>>>>>> attach a plot showing network throughput on our actual mesh >>>>>>>>> generation application. Nodes cfsc01-04 are running 1.6.0 and >>>>>>>>> finish within 35 >>>>>>>> minutes. >>>>>>>>> Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and >>>>>>>>> take over >>>>>>>> a >>>>>>>>> hour to run. There seems to be a much greater network demand in the >>>>>>>>> 1.6.1 >>>>>>>>> version, despite the user-code and input data being identical. >>>>>>>>> >>>>>>>>> Thanks for any help you can give, >>>>>>>>> Simon >>>>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users