George Bosilca wrote:

MPI does not impose any global order on the messages. The only requirement is that between two peers on the same communicator the messages (or at least the part required for the matching) is delivered in order. This make both execution traces you sent with your original email (shared memory and TCP) valid from the MPI perspective.

Moreover, MPI doesn't impose any order in the matching when ANY_SOURCE is used. In Open MPI we do the matching _ALWAYS_ starting from rank 0 to n in the specified communicator. BEWARE: The remaining of this paragraph is deep black magic of an MPI implementation internals. The main difference between the behavior of SM and TCP here directly reflect their eager size, 4K for SM and 64K for TCP. Therefore, for your example, for TCP all your messages are eager messages (i.e. are completely transfered to the destination process in just one go), while for SM they all require a rendez-vous. This directly impact the ordering of the messages on the receiver, and therefore the order of the matching. However, I have to insist on this, this behavior is correct based on the MPI standard specifications.

I'm going to try a technical explanation of what's going on inside OMPI and then words of advice to Mark.

First, the technical explanation. As George says, what's going on is legal. The "servers" all queue up sends to the "compositor". These are long, rendezvous sends (at least when they're on-node). So, none of these sends completes. The compositor looks for an in-coming message. It's gets the header of the message and sends back an acknowledgement that the rest of the message can be sent. The "server" gets the acknowledgement and starts sending more of the message. The compositor, in order to get to the remainder of the message, keeps draining all the other stuff servers are sending it. Once the first message is completely received, the compositor looks for the next message to process and happens to pick up the first server again. It won't go to anyone else under server 1 is exhausted. Legal, but from Mark's point of view not desirable. The compositor is busy all the time. Mark just wants it to employ a different order.

The receives are "serialized". Of course they must be since the receiver is a single process. But Mark's performance issue is that the servers aren't being serviced equally. So, they back up while server unfairly gets all the attention.

Mark, your test code has a set of buffers it cycles through on each server. Could you do something similar on the compositor side? Have a set of resources for each server. If you want the compositor to service all servers equally/fairly, you're going to have to prescribe this behavior in your MPI code. The MPI implementation can't be relied on to do this for you.

If this doesn't make sense, let me know and I'll try to sketch is out more explicitly.

Reply via email to