Waitall was not returning for the mundane reason that not all messages
sent were received. I'm not sure why the dump command seemed to say
there was nothing waiting. Ironically, the bug would never appear in
production, only in testing.
I fixed up my shutdown logic and all seems well now.
Ross
On 4/10/2014 1:06 PM, Ross Boylan wrote:
On 4/10/2014 11:48 AM, Ross Boylan wrote:
On 4/9/2014 5:26 PM, Ross Boylan wrote:
On Fri, 2014-04-04 at 22:40 -0400, George Bosilca wrote:
Ross,
I’m not familiar with the R implementation you are using, but bear
with me and I will explain how you can all Open MPI about the list
of all pending requests on a process. Disclosure: This is Open MPI
deep voodoo, an extreme way to debug applications that might save
you quite some time.
The only thing you need is the communicator you posted your
requests into, or at least a pointer to it. Then you attach to your
process (or processes) with your preferred debugger and call
mca_pml_ob1_dump(struct ompi_communicator_t* comm, int verbose)
With gdb this should look like “call mca_pml_ob1_dump(my_comm, 1)”.
This will dump human readable information about all the requests
pending on a communicator (both sends and receives).
Thank you so much for the tip. After inserting a barrier failed to
help
I managed to reproduce the problem with all ranks on one node. I see
BTL SM 0x7fe9970ae660 endpoint 0x1f13470 [smp_rank 5] [peer_rank 0]
....
BTL SM 0x7fe9970ae660 endpoint 0x20eebb0 [smp_rank 5] [peer_rank 12]
which, if my previous theory of mca_mpl_ob1_dump is correct, means
there are no outstanding requests since there are no items listed
under the BTL lines.
This again has me wondering if requests can be closed without some
kind of Wait or Test command.
Sometimes the system runs to completion; the trigger seems to be
having some ranks that finish rapidly because there are more such
processes than work for them to do.