Hello,

I am trying to confirm that I am using OpenMPI in a correct way. I
seem to be losing messages but I don't like to assume there's a bug
when I'm still new to MPI in general.

I have multiple processes in a master / slaves type setup, and I am
trying to have multiple persistent non-blocking message requests
between them to prevent starvation. (Tech detail: 4-core Intel running
Ubuntu 64-bit and OpenMPI 1.4. Everything is local. Total processes is
5. One master, four slaves. The problem only surfaces on the slowest
slave - the one with the most work.)

The setup is like this:

Master:

Create 3 persistent send requests, with three different buffers (in a 2D array)
Load data into each buffer
Start each send request
In a loop:
TestSome on the 3 sends
for each send that's completed
 load new data into the buffer
 restart that send
loop

Slave:

Create 3 persistent receive requests, with three different buffers (in
a 2D array)
Start each receive request
In a loop:
WaitAny on the 3 receives
Consume data from the one receive buffer from WaitAny
Start that receive again
loop

Basically what I'm seeing is that the master gets a "completed" send
request from TestSome and loads new data, restarts, etc. but the slave
never sees that particular message. I was under the impression that
WaitAny should return only one message but also should eventually
return every message sent in this situation.

I am operating under the assumption that even if the send request is
completed and the buffer overwritten in the master, the receive for
that message eventually occurs with the correct data in the slave. I
did not think I had to advise the master that the slave was finished
reading data out of the receive buffer before the master could reuse
the send buffer.

What it LOOKS like to me is that WaitAny is marking more than one send
completed, so the master sends the next message, but I can't see it in
the slave.

I hope this is making sense. Any input on whether I'm doing this wrong
or a way to see if the message is really being lost would be helpful.
If there's a good example code of multiple simultaneous asynchronous
messages to avoid starvation that is set up better than my approach,
I'd like to see it.

Thanks!

Corey

Reply via email to