Re: [OMPI users] Bug in 1.3.2?: sm btl and isend is serializes

Eugene Loh Fri, 19 Jun 2009 17:33:10 -0400

George Bosilca wrote:

MPI does not impose any global order on the messages. The onlyrequirement is that between two peers on the same communicator themessages (or at least the part required for the matching) isdelivered in order. This make both execution traces you sent withyour original email (shared memory and TCP) valid from the MPIperspective.
Moreover, MPI doesn't impose any order in the matching whenANY_SOURCE is used. In Open MPI we do the matching _ALWAYS_ startingfrom rank 0 to n in the specified communicator. BEWARE: The remainingof this paragraph is deep black magic of an MPI implementationinternals. The main difference between the behavior of SM and TCPhere directly reflect their eager size, 4K for SM and 64K for TCP.Therefore, for your example, for TCP all your messages are eagermessages (i.e. are completely transfered to the destination processin just one go), while for SM they all require a rendez-vous. Thisdirectly impact the ordering of the messages on the receiver, andtherefore the order of the matching. However, I have to insist onthis, this behavior is correct based on the MPI standard specifications.

I'm going to try a technical explanation of what's going on inside OMPIand then words of advice to Mark.

First, the technical explanation. As George says, what's going on islegal. The "servers" all queue up sends to the "compositor". These arelong, rendezvous sends (at least when they're on-node). So, none ofthese sends completes. The compositor looks for an in-coming message.It's gets the header of the message and sends back an acknowledgementthat the rest of the message can be sent. The "server" gets theacknowledgement and starts sending more of the message. The compositor,in order to get to the remainder of the message, keeps draining all theother stuff servers are sending it. Once the first message iscompletely received, the compositor looks for the next message toprocess and happens to pick up the first server again. It won't go toanyone else under server 1 is exhausted. Legal, but from Mark's pointof view not desirable. The compositor is busy all the time. Mark justwants it to employ a different order.

The receives are "serialized". Of course they must be since thereceiver is a single process. But Mark's performance issue is that theservers aren't being serviced equally. So, they back up while serverunfairly gets all the attention.

Mark, your test code has a set of buffers it cycles through on eachserver. Could you do something similar on the compositor side? Have aset of resources for each server. If you want the compositor to serviceall servers equally/fairly, you're going to have to prescribe thisbehavior in your MPI code. The MPI implementation can't be relied on todo this for you.

If this doesn't make sense, let me know and I'll try to sketch is outmore explicitly.

Re: [OMPI users] Bug in 1.3.2?: sm btl and isend is serializes

Reply via email to