I managed to find time to reproduce the issue, although it is not very reproducible in it's results and I suspect it may not be easy to reproduce with a simple code plus I've never actually constructed a mpi code so.... (I am cc'ing Michael Sternberg who compiled the openmpi in case there are flags to add to the compilation.)
I have 8 processes on a single dual quadcore reading from the same file using formatted fortran I/O. I deliberately created an error in the read. If this error is a format error, all the processes terminate. If the error is because there is not enough data (EOF), I get somewhere from 1 to 7 zombie's. They don't seem to be doing anything (top -ulmarks shows no CPU activity) but I have no idea if they have locks on the file or anything else (I think they might, but have no idea how to tell). On Fri, Jan 29, 2010 at 6:18 PM, Jeff Squyres <jsquy...@cisco.com> wrote: > On Jan 29, 2010, at 9:13 AM, Laurence Marks wrote: > >> OK, but trivial codes don't always reproduce problems. > > Yes, but if the problem is a file reading beyond the end, that should be > fairly isolated behavior. > >> Is strace useful? > > Sure. But let's check to see if the apps are actually dying or hanging first. > > -- > Jeff Squyres > jsquy...@cisco.com > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Chair, Commission on Electron Crystallography of IUCR www.numis.northwestern.edu/ Electron crystallography is the branch of science that uses electron scattering and imaging to study the structure of matter.