On 4/10/2014 11:48 AM, Ross Boylan wrote:
On 4/9/2014 5:26 PM, Ross Boylan wrote:
On Fri, 2014-04-04 at 22:40 -0400, George Bosilca wrote:
Ross,

I’m not familiar with the R implementation you are using, but bear with me and I will explain how you can all Open MPI about the list of all pending requests on a process. Disclosure: This is Open MPI deep voodoo, an extreme way to debug applications that might save you quite some time.

The only thing you need is the communicator you posted your requests into, or at least a pointer to it. Then you attach to your process (or processes) with your preferred debugger and call
   mca_pml_ob1_dump(struct ompi_communicator_t* comm, int verbose)

With gdb this should look like “call mca_pml_ob1_dump(my_comm, 1)”. This will dump human readable information about all the requests pending on a communicator (both sends and receives).

Thank you so much for the tip.  After inserting a barrier failed to help
I managed to reproduce the problem with all ranks on one node.  I see
BTL SM 0x7fe9970ae660 endpoint 0x1f13470 [smp_rank 5] [peer_rank 0]
....
BTL SM 0x7fe9970ae660 endpoint 0x20eebb0 [smp_rank 5] [peer_rank 12]
which, if my previous theory of mca_mpl_ob1_dump is correct, means there are no outstanding requests since there are no items listed under the BTL lines.

This again has me wondering if requests can be closed without some kind of Wait or Test command.

Sometimes the system runs to completion; the trigger seems to be having some ranks that finish rapidly because there are more such processes than work for them to do.


Reply via email to