Dennis Luxen wrote:
In MPI, you must complete every MPI_Isend by MPI_Wait on the request
handle
(or a variant like MPI_Waitall or MPI_Test that returns TRUE). An
un-completed MPI_Isend leaves resources tied up.
Good point, but that doesn't seem to help. I augmented each MPI_Isend
with a MPI_Wait.
What does that mean? Does that mean you immediately followed each Isend
with a Wait? Equivalently, did you replace each Isend with a Send?
In your original message, you said each process started by sending a
100K request. If that's the case, and you have blocking sends (or
Isends augmented with Waits), you're not guaranteed progress. E.g.,
consider the last example in
http://www.mpi-forum.org/docs/mpi-11-html/node41.html#Node41 . But your
example code sends only single-int requests. So, this shouldn't be an
issue for your sample code.
Anyhow, I ran your sample code and it hung. Then I replaced Isends with
Sends and it ran. So, at that level, I am as yet unable to reproduce
your problem.
Now, one process keeps hanging after a number of messages in MPI_Wait
and the other one keeps MPI_Iprobe'ing for new messages to receive.
I do not know what symptom to expect from OpenMPI with this particular
application error but the one you describe is plausible.
If I start with the parameter "--mca btl tcp,self" on the other hand,
the processes finish communication just fine. I am not exactly sure
why this flag helps.