Ross, I’m not familiar with the R implementation you are using, but bear with me and I will explain how you can all Open MPI about the list of all pending requests on a process. Disclosure: This is Open MPI deep voodoo, an extreme way to debug applications that might save you quite some time.
The only thing you need is the communicator you posted your requests into, or at least a pointer to it. Then you attach to your process (or processes) with your preferred debugger and call mca_pml_ob1_dump(struct ompi_communicator_t* comm, int verbose) With gdb this should look like “call mca_pml_ob1_dump(my_comm, 1)”. This will dump human readable information about all the requests pending on a communicator (both sends and receives). If you are right, all processes will report NONE, and the bug is somewhere in-between your application and the MPI library. Otherwise, you might have some not-yet-completed requests pending… George. On Apr 4, 2014, at 22:20 , Ross Boylan <r...@biostat.ucsf.edu> wrote: > On 4/4/2014 6:01 PM, Ralph Castain wrote: >> It sounds like you don't have a balance between sends and recvs somewhere - >> i.e., some apps send messages, but the intended recipient isn't issuing a >> recv and waiting until the message has been received before exiting. If the >> recipient leaves before the isend completes, then the isend will never >> complete and the waitall will not return. > I'm pretty sure the sends complete because I wait on something that can only > be computed after the sends complete, and I know I have that result. > > My current theory is that my modifications to Rmpi are not properly tracking > all completed messages, resulting in it thinking there are outstanding > messages (and passing a positive count to the C-level MPI_Waitall with > associated garbagey arrays). But I haven't isolated the problem. > > Ross >> >> >> On Apr 4, 2014, at 5:20 PM, Ross Boylan <r...@biostat.ucsf.edu> wrote: >> >>> During shutdown of my application the processes issue a waitall, since they >>> have done some Isends. A couple of them never return from that call. >>> >>> Could this be the result of some of the processes already being shutdown >>> (the processes with the problem were late in the shutdown sequence)? If >>> so, what is the recommended solution? A barrier? >>> >>> The shutdown proceeds in stages, but the processes in question are not told >>> to shutdown until all the messages they have sent have been received. So >>> there shouldn't be any outstanding messages from them. >>> >>> My reading of the manual is that Waitall with a count of 0 should return >>> immediately, not hang. Is that correct? >>> >>> Running under R with openmpi 1.7.4. >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users