Thanks, that at least explains what is going on. Because I have an
unbalanced work load (at least for now) I assume that I'll need to poll. If
I replace the compositor loop with the following, it appears that I prevent
the serialization/starvation and service the servers equally. I can think of
edge cases where it isn't very efficient, so I'll explore different options
(perhaps instead of looping I can probe one higher and increment on each
receive).

Thanks again.

Here's the new output:
...
Sending buffer 3 from 3
Sending buffer 3 from 2
Sending buffer 4 from 1
Receiving buffer from 1, buffer = hello from 1 for the 0 time
 -- Probing for 2
 -- Found a message
Sending buffer 4 from 3
Sending buffer 4 from 2
Receiving buffer from 2, buffer = hello from 2 for the 0 time
 -- Probing for 3
 -- Found a message
Receiving buffer from 3, buffer = hello from 3 for the 0 time
 -- Probing for 1
 -- Found a message
Sending buffer 5 from 1
Receiving buffer from 1, buffer = hello from 1 for the 1 time
 -- Probing for 2
 -- Found a message
Sending buffer 5 from 2
Sending buffer 5 from 3
Receiving buffer from 2, buffer = hello from 2 for the 1 time
 -- Probing for 3
 -- Found a message
Receiving buffer from 3, buffer = hello from 3 for the 1 time
...
and the replacement code:

     int last = 0;

     for (i = 0; i < LOOPS * ( size - 1 ); i++)
     {
        int which_source, which_tag, flag;

        MPI_Probe( MPI_ANY_SOURCE, MPI_ANY_TAG, comp_comm, &status );
        which_source = status.MPI_SOURCE;
        which_tag = status.MPI_TAG;
        if ( which_source <= last )
        {
           MPI_Status probe_status;

           for (j = 0; j < size - 1; j++)
           {
          int probe_id = ( ( last + j ) % ( size - 1)  ) + 1;

          printf( " -- Probing for %d\n", probe_id );

          MPI_Iprobe( probe_id, MPI_ANY_TAG, comp_comm, &flag, &probe_status
);
          if ( flag )
          {
             printf( " -- Found a message\n" );
             which_source = probe_status.MPI_SOURCE;
             which_tag = probe_status.MPI_TAG;
             break;
          }
           }
        }

        printf( "Receiving buffer from %d, buffer = ", which_source );
        MPI_Recv( buffer, BUFLEN, MPI_CHAR, which_source, which_tag,
comp_comm, &status );
        printf( "%s\n", buffer );
        last = which_source;
     }


Mark

On Fri, Jun 19, 2009 at 5:33 PM, Eugene Loh <eugene....@sun.com> wrote:

> George Bosilca wrote:
>
>  MPI does not impose any global order on the messages. The only
>>  requirement is that between two peers on the same communicator the
>>  messages (or at least the part required for the matching) is delivered  in
>> order. This make both execution traces you sent with your original  email
>> (shared memory and TCP) valid from the MPI perspective.
>>
>> Moreover, MPI doesn't impose any order in the matching when ANY_SOURCE  is
>> used. In Open MPI we do the matching _ALWAYS_ starting from rank 0  to n in
>> the specified communicator. BEWARE: The remaining of this  paragraph is deep
>> black magic of an MPI implementation internals. The  main difference between
>> the behavior of SM and TCP here directly  reflect their eager size, 4K for
>> SM and 64K for TCP. Therefore, for  your example, for TCP all your messages
>> are eager messages (i.e. are  completely transfered to the destination
>> process in just one go),  while for SM they all require a rendez-vous. This
>> directly impact the  ordering of the messages on the receiver, and therefore
>> the order of  the matching. However, I have to insist on this, this behavior
>> is  correct based on the MPI standard specifications.
>>
>
> I'm going to try a technical explanation of what's going on inside OMPI and
> then words of advice to Mark.
>
> First, the technical explanation.  As George says, what's going on is
> legal.  The "servers" all queue up sends to the "compositor".  These are
> long, rendezvous sends (at least when they're on-node).  So, none of these
> sends completes.  The compositor looks for an in-coming message.  It's gets
> the header of the message and sends back an acknowledgement that the rest of
> the message can be sent.  The "server" gets the acknowledgement and starts
> sending more of the message.  The compositor, in order to get to the
> remainder of the message, keeps draining all the other stuff servers are
> sending it.  Once the first message is completely received, the compositor
> looks for the next message to process and happens to pick up the first
> server again.  It won't go to anyone else under server 1 is exhausted.
>  Legal, but from Mark's point of view not desirable.  The compositor is busy
> all the time.  Mark just wants it to employ a different order.
>
> The receives are "serialized".  Of course they must be since the receiver
> is a single process.  But Mark's performance issue is that the servers
> aren't being serviced equally.  So, they back up while server unfairly gets
> all the attention.
>
> Mark, your test code has a set of buffers it cycles through on each server.
>  Could you do something similar on the compositor side?  Have a set of
> resources for each server.  If you want the compositor to service all
> servers equally/fairly, you're going to have to prescribe this behavior in
> your MPI code.  The MPI implementation can't be relied on to do this for
> you.
>
> If this doesn't make sense, let me know and I'll try to sketch is out more
> explicitly.
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to