Hi, Amjad:

[...]
What I do is that I start non blocking MPI communication at the partition boundary faces (faces shared between any two processors) , and then start computing values on the internal/non-shared faces. When I complete this computation, I put WAITALL to ensure MPI communication completion. Then I do computation on the partition boundary faces (shared-ones). This way I try to hide the communication behind computation. Is it correct?

As long as your numerical method allows you to do this (that is, you definitely don't need those boundary values to compute the internal values), then yes, this approach can hide some of the communication costs very effectively. The way I'd program this if I were doing it from scratch would be to do the usual blocking approach (no one computes anything until all the faces are exchanged) first and get that working, then break up the computation step into internal and boundary computations and make sure it still works, and then change the messaging to isends/irecvs/waitalls, and make sure it still works, and only then interleave the two.

IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less elements) with an another processor B then it sends/recvs 50 different messages. So in general if a processors has X number of faces sharing with any number of other processors it sends/recvs that much messages. Is this way has "very much reduced" performance in comparison to the possibility that processor A will send/recv a single-bundle message (containg all 50-faces-data) to process B. Means that in general a processor will only send/recv that much messages as the number of processors neighbour to it. It will send a single bundle/pack of messages to each neighbouring processor.
Is their "quite a much difference" between these two approaches?

Your individual element faces that are being communicated are likely quite small. It is quite generally the case that bundling many small messages into large messages can significantly improve performance, as you avoid incurring the repeated latency costs of sending many messages.

As always, though, the answer is `it depends', and the only way to know is to try it both ways. If you really do hide most of the communications cost with your non-blocking communications, then it may not matter too much. In addition, if you don't know beforehand how much data you need to send/receive, then you'll need a handshaking step which introduces more synchronization and may actually hurt performance, or you'll have to use MPI2 one-sided communications. On the other hand, if this shared boundary doesn't change through the simulation, you could just figure out at start-up time how big the messages will be between neighbours and use that as the basis for the usual two-sided messages.

My experience is that there's an excellent chance you'll improve the performance by packing the little messages into fewer larger messages.

   Jonathan
--
Jonathan Dursi     <ljdu...@scinet.utoronto.ca>

Reply via email to