Re: [OMPI users] ompi-checkpoint is hanging

2008-12-16 Thread Josh Hursey
Matthias, I think that the patch attached to the ticket below should address your issue: https://svn.open-mpi.org/trac/ompi/ticket/1619 I was able to reproduce this problem fairly reliably with a particular benchmark, on a particular configuration and very frequent checkpoints. With this

Re: [OMPI users] ompi-checkpoint is hanging

2008-10-31 Thread Matthias Hovestadt
Hi! I'll work on a patch, and let you know when it is ready. Unfortunately it probably won't be for a couple weeks. :( Ok, thanks a lot for letting me know. In three weeks we'll have a booth at ICT (http://ec.europa.eu/information_society/events/ict/2008) where we plan to showcase fault tolera

Re: [OMPI users] ompi-checkpoint is hanging

2008-10-31 Thread Josh Hursey
After some additional testing I believe that I have been able to reproduce the problem. I suspect that there is a bug in the coordination protocol that is causing an occasional hang in the system. Since it only happens occasionally (though slightly more often on a fully loaded machine) that

Re: [OMPI users] ompi-checkpoint is hanging

2008-10-31 Thread Matthias Hovestadt
Hi Tim! First of all: thanks a lot for answering! :-) Could you try running your two MPI jobs with fewer procs each, say 2 or 3 each instead of 4, so that there are a few extra cores available. This problem occurrs with any number of procs. Also, what happens to the checkpointing of one MP

Re: [OMPI users] ompi-checkpoint is hanging

2008-10-31 Thread Tim Mattox
Hello Matthias, Hopefully Josh will chime in shortly. But I have one suggestion to help diagnose this. Could you try running your two MPI jobs with fewer procs each, say 2 or 3 each instead of 4, so that there are a few extra cores available. I know that isn't a solution, but it may help us diagn