On Fri, Jan 29, 2010 at 6:59 AM, Jeff Squyres <jsquy...@cisco.com> wrote: > On Jan 28, 2010, at 2:23 PM, Laurence Marks wrote: > >> > If one process dies prematurely in Open MPI (i.e., before MPI_Finalize), >> > all the others > should be automatically killed. >> >> This does not seem to be happening. Part of the problem may be (and I >> am out of my depth here) that the fortran rtl library (ifort) does not >> appear to be dying prematurely, at least there is no signal that I can >> catch (I'm not a good c programmer). > > Ahh. That would be a problem. If the process doesn't die, then Open MPI has > no way to know that it is hung, and therefore any other MPI processes that > are waiting for messages (or whatever) from the hung process will eventually > block waiting for input that will never come. End result: the entire MPI job > hangs. > > Can you double check that this is actually what is happening? I.e., that no > process is actually exiting? It would just be good to confirm that that is > actually what is happening (and make me feel better that we don't have some > corner case where an MPI process aborting early isn't terminating the entire > job properly). If you run your MPI job and you see this error occurs, go run > "ps" on all the nodes where the job is running and count the number of MPI > processes that you see.
I'll try, but sometimes these things are hard to reproduce and I have to wait for free nodes to do the test. If I do manage to reproduce the issue (I've added ERR= traps, so would have to regress) any thing else to look at? > >> I posted to the Intel ifort site as well, and the response I got (see >> link below) is that "There is a feature request in to add this >> functionality, but it is not currently on the list for >> implementation." >> >> http://software.intel.com/en-us/forums/showthread.php?t=71571&o=d&s=lr > > Bummer! > > I'm tangentially involved in Fortran/MPI stuff, but I'm not enough of a > Fortran expert to know how to help here -- I understand that in your final > production code, this problem likely won't occur. But that doesn't help > while you're writing / debugging the code itself (which is a huge amount of > time and effort). > > -- > Jeff Squyres > jsquy...@cisco.com > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Chair, Commission on Electron Crystallography of IUCR www.numis.northwestern.edu/ Electron crystallography is the branch of science that uses electron scattering and imaging to study the structure of matter.