Hello, I am trying to confirm that I am using OpenMPI in a correct way. I seem to be losing messages but I don't like to assume there's a bug when I'm still new to MPI in general.
I have multiple processes in a master / slaves type setup, and I am trying to have multiple persistent non-blocking message requests between them to prevent starvation. (Tech detail: 4-core Intel running Ubuntu 64-bit and OpenMPI 1.4. Everything is local. Total processes is 5. One master, four slaves. The problem only surfaces on the slowest slave - the one with the most work.) The setup is like this: Master: Create 3 persistent send requests, with three different buffers (in a 2D array) Load data into each buffer Start each send request In a loop: TestSome on the 3 sends for each send that's completed load new data into the buffer restart that send loop Slave: Create 3 persistent receive requests, with three different buffers (in a 2D array) Start each receive request In a loop: WaitAny on the 3 receives Consume data from the one receive buffer from WaitAny Start that receive again loop Basically what I'm seeing is that the master gets a "completed" send request from TestSome and loads new data, restarts, etc. but the slave never sees that particular message. I was under the impression that WaitAny should return only one message but also should eventually return every message sent in this situation. I am operating under the assumption that even if the send request is completed and the buffer overwritten in the master, the receive for that message eventually occurs with the correct data in the slave. I did not think I had to advise the master that the slave was finished reading data out of the receive buffer before the master could reuse the send buffer. What it LOOKS like to me is that WaitAny is marking more than one send completed, so the master sends the next message, but I can't see it in the slave. I hope this is making sense. Any input on whether I'm doing this wrong or a way to see if the message is really being lost would be helpful. If there's a good example code of multiple simultaneous asynchronous messages to avoid starvation that is set up better than my approach, I'd like to see it. Thanks! Corey