Re: [OMPI users] efficient strategy with temporary message copy

Saliya Ekanayake Mon, 17 Mar 2014 20:06:28 -0400 (EDT)

Also, this presentation might be useful
http://extremecomputingtraining.anl.gov/files/2013/07/tuesday-slides2.pdf


Thank you,
Saliya
On Mar 17, 2014 2:18 PM, "christophe petit" <christophe.peti...@gmail.com>
wrote:

> Thanks Jeff, I understand better the different cases and how to choose as
> a function of the situation
>
>
> 2014-03-17 16:31 GMT+01:00 Jeff Squyres (jsquyres) <jsquy...@cisco.com>:
>
>> On Mar 16, 2014, at 10:24 PM, christophe petit <
>> christophe.peti...@gmail.com> wrote:
>>
>> > I am studying the optimization strategy when the number of
>> communication functions in a codeis high.
>> >
>> > My courses on MPI say two things for optimization which are
>> contradictory :
>> >
>> > 1*) You have to use temporary message copy to allow non-blocking
>> sending and uncouple the sending and receiving
>>
>> There's a lot of schools of thought here, and the real answer is going to
>> depend on your application.
>>
>> If the message is "short" (and the exact definition of "short" depends on
>> your platform -- it varies depending on your CPU, your memory, your
>> CPU/memory interconnect, ...etc.), then copying to a pre-allocated bounce
>> buffer is typically a good idea.  That lets you keep using your "real"
>> buffer and not have to wait until communication is done.
>>
>> For "long" messages, the equation is a bit different.  If "long" isn't
>> "enormous", you might be able to have N buffers available, and simply work
>> on 1 of them at a time in your main application and use the others for
>> ongoing non-blocking communication.  This is sometimes called "shadow"
>> copies, or "ghost" copies.
>>
>> Such shadow copies are most useful when you receive something each
>> iteration, for example.  For example, something like this:
>>
>>   buffer[0] = malloc(...);
>>   buffer[1] = malloc(...);
>>   current = 0;
>>   while (still_doing_iterations) {
>>       MPI_Irecv(buffer[current], ..., &req);
>>       /// work on buffer[current - 1]
>>       MPI_Wait(req, MPI_STATUS_IGNORE);
>>       current = 1 - current;
>>   }
>>
>> You get the idea.
>>
>> > 2*) Avoid using temporary message copy because the copy will add extra
>> cost on execution time.
>>
>> It will, if the memcpy cost is significant (especially compared to the
>> network time to send it).  If the memcpy is small/insignificant, then don't
>> worry about it.
>>
>> You'll need to determine where this crossover point is, however.
>>
>> Also keep in mind that MPI and/or the underlying network stack will
>> likely be doing these kinds of things under the covers for you.  Indeed, if
>> you send short messages -- even via MPI_SEND -- it may return
>> "immediately", indicating that MPI says it's safe for you to use the send
>> buffer.  But that doesn't mean that the message has even actually left the
>> current server and gone out onto the network yet (i.e., some other layer
>> below you may have just done a memcpy because it was a short message, and
>> the processing/sending of that message is still ongoing).
>>
>> > And then, we are adviced to do :
>> >
>> > - replace MPI_SEND by MPI_SSEND (synchroneous blocking sending) : it is
>> said that execution is divided by a factor 2
>>
>> This very, very much depends on your application.
>>
>> MPI_SSEND won't return until the receiver has started to receive the
>> message.
>>
>> For some communication patterns, putting in this additional level of
>> synchronization is helpful -- it keeps all MPI processes in tighter
>> synchronization and you might experience less jitter, etc.  And therefore
>> overall execution time is faster.
>>
>> But for others, it adds unnecessary delay.
>>
>> I'd say it's an over-generalization that simply replacing MPI_SEND with
>> MPI_SSEND always reduces execution time by 2.
>>
>> > - use MPI_ISSEND and MPI_IRECV with MPI_WAIT function to synchronize
>> (synchroneous non-blocking sending) : it is said that execution is divided
>> by a factor 3
>>
>> Again, it depends on the app.  Generally, non-blocking communication is
>> better -- *if your app can effectively overlap communication and
>> computation*.
>>
>> If your app doesn't take advantage of this overlap, then you won't see
>> such performance benefits.  For example:
>>
>>    MPI_Isend(buffer, ..., req);
>>    MPI_Wait(&req, ...);
>>
>> Technically, the above uses ISEND and WAIT... but it's actually probably
>> going to be *slower* than using MPI_SEND because you've made multiple
>> function calls with no additional work between the two -- so the app didn't
>> effectively overlap the communication with any local computation.  Hence:
>> no performance benefit.
>>
>> > So what's the best optimization ? Do we have to use temporary message
>> copy or not and if yes, what's the case for ?
>>
>> As you can probably see from my text above, the answer is: it depends.
>>  :-)
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] efficient strategy with temporary message copy

Reply via email to