Re: [OMPI users] ompi-checkpoint is hanging

2008-12-16 Thread Josh Hursey
Matthias, I think that the patch attached to the ticket below should address your issue: https://svn.open-mpi.org/trac/ompi/ticket/1619 I was able to reproduce this problem fairly reliably with a particular benchmark, on a particular configuration and very frequent checkpoints. With this

Re: [OMPI users] ompi-checkpoint is hanging

2008-10-31 Thread Matthias Hovestadt
Hi! I'll work on a patch, and let you know when it is ready. Unfortunately it probably won't be for a couple weeks. :( Ok, thanks a lot for letting me know. In three weeks we'll have a booth at ICT (http://ec.europa.eu/information_society/events/ict/2008) where we plan to showcase fault tolera

Re: [OMPI users] ompi-checkpoint is hanging

2008-10-31 Thread Josh Hursey
After some additional testing I believe that I have been able to reproduce the problem. I suspect that there is a bug in the coordination protocol that is causing an occasional hang in the system. Since it only happens occasionally (though slightly more often on a fully loaded machine) that

Re: [OMPI users] ompi-checkpoint is hanging

2008-10-31 Thread Matthias Hovestadt
Hi Tim! First of all: thanks a lot for answering! :-) Could you try running your two MPI jobs with fewer procs each, say 2 or 3 each instead of 4, so that there are a few extra cores available. This problem occurrs with any number of procs. Also, what happens to the checkpointing of one MP

Re: [OMPI users] ompi-checkpoint is hanging

2008-10-31 Thread Tim Mattox
Hello Matthias, Hopefully Josh will chime in shortly. But I have one suggestion to help diagnose this. Could you try running your two MPI jobs with fewer procs each, say 2 or 3 each instead of 4, so that there are a few extra cores available. I know that isn't a solution, but it may help us diagn

[OMPI users] ompi-checkpoint is hanging

2008-10-31 Thread Matthias Hovestadt
Hi! I'm using the development version of OMPI from SVN (rev. 19857) for executing MPI jobs on my cluster system. I'm particularly using the checkpoint and restart feature, basing on the currentmost version of BLCR. The checkpointing is working pretty fine as long as I only execute a single job o