Also, this presentation might be useful http://extremecomputingtraining.anl.gov/files/2013/07/tuesday-slides2.pdf
Thank you, Saliya On Mar 17, 2014 2:18 PM, "christophe petit" <christophe.peti...@gmail.com> wrote: > Thanks Jeff, I understand better the different cases and how to choose as > a function of the situation > > > 2014-03-17 16:31 GMT+01:00 Jeff Squyres (jsquyres) <jsquy...@cisco.com>: > >> On Mar 16, 2014, at 10:24 PM, christophe petit < >> christophe.peti...@gmail.com> wrote: >> >> > I am studying the optimization strategy when the number of >> communication functions in a codeis high. >> > >> > My courses on MPI say two things for optimization which are >> contradictory : >> > >> > 1*) You have to use temporary message copy to allow non-blocking >> sending and uncouple the sending and receiving >> >> There's a lot of schools of thought here, and the real answer is going to >> depend on your application. >> >> If the message is "short" (and the exact definition of "short" depends on >> your platform -- it varies depending on your CPU, your memory, your >> CPU/memory interconnect, ...etc.), then copying to a pre-allocated bounce >> buffer is typically a good idea. That lets you keep using your "real" >> buffer and not have to wait until communication is done. >> >> For "long" messages, the equation is a bit different. If "long" isn't >> "enormous", you might be able to have N buffers available, and simply work >> on 1 of them at a time in your main application and use the others for >> ongoing non-blocking communication. This is sometimes called "shadow" >> copies, or "ghost" copies. >> >> Such shadow copies are most useful when you receive something each >> iteration, for example. For example, something like this: >> >> buffer[0] = malloc(...); >> buffer[1] = malloc(...); >> current = 0; >> while (still_doing_iterations) { >> MPI_Irecv(buffer[current], ..., &req); >> /// work on buffer[current - 1] >> MPI_Wait(req, MPI_STATUS_IGNORE); >> current = 1 - current; >> } >> >> You get the idea. >> >> > 2*) Avoid using temporary message copy because the copy will add extra >> cost on execution time. >> >> It will, if the memcpy cost is significant (especially compared to the >> network time to send it). If the memcpy is small/insignificant, then don't >> worry about it. >> >> You'll need to determine where this crossover point is, however. >> >> Also keep in mind that MPI and/or the underlying network stack will >> likely be doing these kinds of things under the covers for you. Indeed, if >> you send short messages -- even via MPI_SEND -- it may return >> "immediately", indicating that MPI says it's safe for you to use the send >> buffer. But that doesn't mean that the message has even actually left the >> current server and gone out onto the network yet (i.e., some other layer >> below you may have just done a memcpy because it was a short message, and >> the processing/sending of that message is still ongoing). >> >> > And then, we are adviced to do : >> > >> > - replace MPI_SEND by MPI_SSEND (synchroneous blocking sending) : it is >> said that execution is divided by a factor 2 >> >> This very, very much depends on your application. >> >> MPI_SSEND won't return until the receiver has started to receive the >> message. >> >> For some communication patterns, putting in this additional level of >> synchronization is helpful -- it keeps all MPI processes in tighter >> synchronization and you might experience less jitter, etc. And therefore >> overall execution time is faster. >> >> But for others, it adds unnecessary delay. >> >> I'd say it's an over-generalization that simply replacing MPI_SEND with >> MPI_SSEND always reduces execution time by 2. >> >> > - use MPI_ISSEND and MPI_IRECV with MPI_WAIT function to synchronize >> (synchroneous non-blocking sending) : it is said that execution is divided >> by a factor 3 >> >> Again, it depends on the app. Generally, non-blocking communication is >> better -- *if your app can effectively overlap communication and >> computation*. >> >> If your app doesn't take advantage of this overlap, then you won't see >> such performance benefits. For example: >> >> MPI_Isend(buffer, ..., req); >> MPI_Wait(&req, ...); >> >> Technically, the above uses ISEND and WAIT... but it's actually probably >> going to be *slower* than using MPI_SEND because you've made multiple >> function calls with no additional work between the two -- so the app didn't >> effectively overlap the communication with any local computation. Hence: >> no performance benefit. >> >> > So what's the best optimization ? Do we have to use temporary message >> copy or not and if yes, what's the case for ? >> >> As you can probably see from my text above, the answer is: it depends. >> :-) >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >