Hi Lydia I would like to say we clean up perfectly, but.... :-(
The system does try its best. I'm a little surprised here since we usually clean up when an application process fails. Our only known problems are when one or more of the orteds fail, usually due to a node rebooting or failing. We hope to plug that "hole" in the spring. You might try updating to a later version (we are at r12890+ now). I don't think that will totally solve the problem, but it might help. We are working on an "orteclean" program that people can use when the system doesn't clean up properly - it will go through and kill any remaining orteds and cleanup those session directories. Hopefully, we will have something out in that regard in Jan. Meantime, I'll take a look again at the scenario you described and see what I can do. Thanks for your patience! Ralph On 12/19/06 2:53 AM, "Lydia Heck" <lydia.h...@durham.ac.uk> wrote: > > A job which crashes with an floating point underflow (or any IEEE floating > point > exception) fails to clean up after itself using > > openmpi-1.3a1r12695 .. > > Nodes with copies of slaves are sitting there ... > > I also noticed that orted are left behind on other crashed jobs .. > > Should I have to expect this? > > Lydia > > ------------------------------------------ > Dr E L Heck > > University of Durham > Institute for Computational Cosmology > Ogden Centre > Department of Physics > South Road > > DURHAM, DH1 3LE > United Kingdom > > e-mail: lydia.h...@durham.ac.uk > > Tel.: + 44 191 - 334 3628 > Fax.: + 44 191 - 334 3645 > ___________________________________________ > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users