Hi Lydia

I would like to say we clean up perfectly, but.... :-(

The system does try its best. I'm a little surprised here since we usually
clean up when an application process fails. Our only known problems are when
one or more of the orteds fail, usually due to a node rebooting or failing.
We hope to plug that "hole" in the spring.

You might try updating to a later version (we are at r12890+ now). I don't
think that will totally solve the problem, but it might help.

We are working on an "orteclean" program that people can use when the system
doesn't clean up properly - it will go through and kill any remaining orteds
and cleanup those session directories. Hopefully, we will have something out
in that regard in Jan.

Meantime, I'll take a look again at the scenario you described and see what
I can do.

Thanks for your patience!
Ralph


On 12/19/06 2:53 AM, "Lydia Heck" <lydia.h...@durham.ac.uk> wrote:

> 
> A job which crashes with an floating point underflow (or any IEEE floating
> point
> exception) fails to clean up after itself using
> 
> openmpi-1.3a1r12695 ..
> 
> Nodes with copies of slaves are sitting there ...
> 
> I also noticed that orted are left behind on other crashed jobs ..
> 
> Should I have to expect this?
> 
> Lydia
> 
> ------------------------------------------
> Dr E L  Heck
> 
> University of Durham
> Institute for Computational Cosmology
> Ogden Centre
> Department of Physics
> South Road
> 
> DURHAM, DH1 3LE
> United Kingdom
> 
> e-mail: lydia.h...@durham.ac.uk
> 
> Tel.: + 44 191 - 334 3628
> Fax.: + 44 191 - 334 3645
> ___________________________________________
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to