Thanks, that at least explains what is going on. Because I have an unbalanced work load (at least for now) I assume that I'll need to poll. If I replace the compositor loop with the following, it appears that I prevent the serialization/starvation and service the servers equally. I can think of edge cases where it isn't very efficient, so I'll explore different options (perhaps instead of looping I can probe one higher and increment on each receive).
Thanks again. Here's the new output: ... Sending buffer 3 from 3 Sending buffer 3 from 2 Sending buffer 4 from 1 Receiving buffer from 1, buffer = hello from 1 for the 0 time -- Probing for 2 -- Found a message Sending buffer 4 from 3 Sending buffer 4 from 2 Receiving buffer from 2, buffer = hello from 2 for the 0 time -- Probing for 3 -- Found a message Receiving buffer from 3, buffer = hello from 3 for the 0 time -- Probing for 1 -- Found a message Sending buffer 5 from 1 Receiving buffer from 1, buffer = hello from 1 for the 1 time -- Probing for 2 -- Found a message Sending buffer 5 from 2 Sending buffer 5 from 3 Receiving buffer from 2, buffer = hello from 2 for the 1 time -- Probing for 3 -- Found a message Receiving buffer from 3, buffer = hello from 3 for the 1 time ... and the replacement code: int last = 0; for (i = 0; i < LOOPS * ( size - 1 ); i++) { int which_source, which_tag, flag; MPI_Probe( MPI_ANY_SOURCE, MPI_ANY_TAG, comp_comm, &status ); which_source = status.MPI_SOURCE; which_tag = status.MPI_TAG; if ( which_source <= last ) { MPI_Status probe_status; for (j = 0; j < size - 1; j++) { int probe_id = ( ( last + j ) % ( size - 1) ) + 1; printf( " -- Probing for %d\n", probe_id ); MPI_Iprobe( probe_id, MPI_ANY_TAG, comp_comm, &flag, &probe_status ); if ( flag ) { printf( " -- Found a message\n" ); which_source = probe_status.MPI_SOURCE; which_tag = probe_status.MPI_TAG; break; } } } printf( "Receiving buffer from %d, buffer = ", which_source ); MPI_Recv( buffer, BUFLEN, MPI_CHAR, which_source, which_tag, comp_comm, &status ); printf( "%s\n", buffer ); last = which_source; } Mark On Fri, Jun 19, 2009 at 5:33 PM, Eugene Loh <eugene....@sun.com> wrote: > George Bosilca wrote: > > MPI does not impose any global order on the messages. The only >> requirement is that between two peers on the same communicator the >> messages (or at least the part required for the matching) is delivered in >> order. This make both execution traces you sent with your original email >> (shared memory and TCP) valid from the MPI perspective. >> >> Moreover, MPI doesn't impose any order in the matching when ANY_SOURCE is >> used. In Open MPI we do the matching _ALWAYS_ starting from rank 0 to n in >> the specified communicator. BEWARE: The remaining of this paragraph is deep >> black magic of an MPI implementation internals. The main difference between >> the behavior of SM and TCP here directly reflect their eager size, 4K for >> SM and 64K for TCP. Therefore, for your example, for TCP all your messages >> are eager messages (i.e. are completely transfered to the destination >> process in just one go), while for SM they all require a rendez-vous. This >> directly impact the ordering of the messages on the receiver, and therefore >> the order of the matching. However, I have to insist on this, this behavior >> is correct based on the MPI standard specifications. >> > > I'm going to try a technical explanation of what's going on inside OMPI and > then words of advice to Mark. > > First, the technical explanation. As George says, what's going on is > legal. The "servers" all queue up sends to the "compositor". These are > long, rendezvous sends (at least when they're on-node). So, none of these > sends completes. The compositor looks for an in-coming message. It's gets > the header of the message and sends back an acknowledgement that the rest of > the message can be sent. The "server" gets the acknowledgement and starts > sending more of the message. The compositor, in order to get to the > remainder of the message, keeps draining all the other stuff servers are > sending it. Once the first message is completely received, the compositor > looks for the next message to process and happens to pick up the first > server again. It won't go to anyone else under server 1 is exhausted. > Legal, but from Mark's point of view not desirable. The compositor is busy > all the time. Mark just wants it to employ a different order. > > The receives are "serialized". Of course they must be since the receiver > is a single process. But Mark's performance issue is that the servers > aren't being serviced equally. So, they back up while server unfairly gets > all the attention. > > Mark, your test code has a set of buffers it cycles through on each server. > Could you do something similar on the compositor side? Have a set of > resources for each server. If you want the compositor to service all > servers equally/fairly, you're going to have to prescribe this behavior in > your MPI code. The MPI implementation can't be relied on to do this for > you. > > If this doesn't make sense, let me know and I'll try to sketch is out more > explicitly. > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >