On Fri, Jan 29, 2010 at 6:59 AM, Jeff Squyres <jsquy...@cisco.com> wrote:
> On Jan 28, 2010, at 2:23 PM, Laurence Marks wrote:
>
>> > If one process dies prematurely in Open MPI (i.e., before MPI_Finalize), 
>> > all the others > should be automatically killed.
>>
>> This does not seem to be happening. Part of the problem may be (and I
>> am out of my depth here) that the fortran rtl library (ifort) does not
>> appear to be dying prematurely, at least there is no signal that I can
>> catch (I'm not a good c programmer).
>
> Ahh.  That would be a problem.  If the process doesn't die, then Open MPI has 
> no way to know that it is hung, and therefore any other MPI processes that 
> are waiting for messages (or whatever) from the hung process will eventually 
> block waiting for input that will never come.  End result: the entire MPI job 
> hangs.
>
> Can you double check that this is actually what is happening?  I.e., that no 
> process is actually exiting?  It would just be good to confirm that that is 
> actually what is happening (and make me feel better that we don't have some 
> corner case where an MPI process aborting early isn't terminating the entire 
> job properly).  If you run your MPI job and you see this error occurs, go run 
> "ps" on all the nodes where the job is running and count the number of MPI 
> processes that you see.

I'll try, but sometimes these things are hard to reproduce and I have
to wait for free nodes to do the test. If I do manage to reproduce the
issue (I've added ERR= traps, so would have to regress) any thing else
to look at?

>
>> I posted to the Intel ifort site as well, and the response I got (see
>> link below) is that "There is a feature request in to add this
>> functionality, but it is not currently on the list for
>> implementation."
>>
>> http://software.intel.com/en-us/forums/showthread.php?t=71571&o=d&s=lr
>
> Bummer!
>
> I'm tangentially involved in Fortran/MPI stuff, but I'm not enough of a 
> Fortran expert to know how to help here -- I understand that in your final 
> production code, this problem likely won't occur.  But that doesn't help 
> while you're writing / debugging the code itself (which is a huge amount of 
> time and effort).
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.

Reply via email to