Matthias,
I think that the patch attached to the ticket below should address
your issue:
https://svn.open-mpi.org/trac/ompi/ticket/1619
I was able to reproduce this problem fairly reliably with a particular
benchmark, on a particular configuration and very frequent
checkpoints. With this
Hi!
I'll work on a patch, and let you know when it is ready. Unfortunately
it probably won't be for a couple weeks. :(
Ok, thanks a lot for letting me know. In three weeks we'll
have a booth at ICT
(http://ec.europa.eu/information_society/events/ict/2008)
where we plan to showcase fault tolera
After some additional testing I believe that I have been able to
reproduce the problem. I suspect that there is a bug in the
coordination protocol that is causing an occasional hang in the
system. Since it only happens occasionally (though slightly more often
on a fully loaded machine) that
Hi Tim!
First of all: thanks a lot for answering! :-)
Could you try running your two MPI jobs with fewer procs each,
say 2 or 3 each instead of 4, so that there are a few extra cores available.
This problem occurrs with any number of procs.
Also, what happens to the checkpointing of one MP
Hello Matthias,
Hopefully Josh will chime in shortly. But I have one suggestion to
help diagnose
this. Could you try running your two MPI jobs with fewer procs each,
say 2 or 3 each instead of 4, so that there are a few extra cores available.
I know that isn't a solution, but it may help us diagn