On Jan 28, 2010, at 2:23 PM, Laurence Marks wrote:

> > If one process dies prematurely in Open MPI (i.e., before MPI_Finalize), 
> > all the others > should be automatically killed.
> 
> This does not seem to be happening. Part of the problem may be (and I
> am out of my depth here) that the fortran rtl library (ifort) does not
> appear to be dying prematurely, at least there is no signal that I can
> catch (I'm not a good c programmer).

Ahh.  That would be a problem.  If the process doesn't die, then Open MPI has 
no way to know that it is hung, and therefore any other MPI processes that are 
waiting for messages (or whatever) from the hung process will eventually block 
waiting for input that will never come.  End result: the entire MPI job hangs.

Can you double check that this is actually what is happening?  I.e., that no 
process is actually exiting?  It would just be good to confirm that that is 
actually what is happening (and make me feel better that we don't have some 
corner case where an MPI process aborting early isn't terminating the entire 
job properly).  If you run your MPI job and you see this error occurs, go run 
"ps" on all the nodes where the job is running and count the number of MPI 
processes that you see.  

> I posted to the Intel ifort site as well, and the response I got (see
> link below) is that "There is a feature request in to add this
> functionality, but it is not currently on the list for
> implementation."
> 
> http://software.intel.com/en-us/forums/showthread.php?t=71571&o=d&s=lr

Bummer!

I'm tangentially involved in Fortran/MPI stuff, but I'm not enough of a Fortran 
expert to know how to help here -- I understand that in your final production 
code, this problem likely won't occur.  But that doesn't help while you're 
writing / debugging the code itself (which is a huge amount of time and 
effort).  

-- 
Jeff Squyres
jsquy...@cisco.com


Reply via email to