Matthias,
I think that the patch attached to the ticket below should address
your issue:
https://svn.open-mpi.org/trac/ompi/ticket/1619
I was able to reproduce this problem fairly reliably with a particular
benchmark, on a particular configuration and very frequent
checkpoints. With this
Hi!
I'll work on a patch, and let you know when it is ready. Unfortunately
it probably won't be for a couple weeks. :(
Ok, thanks a lot for letting me know. In three weeks we'll
have a booth at ICT
(http://ec.europa.eu/information_society/events/ict/2008)
where we plan to showcase fault tolera
After some additional testing I believe that I have been able to
reproduce the problem. I suspect that there is a bug in the
coordination protocol that is causing an occasional hang in the
system. Since it only happens occasionally (though slightly more often
on a fully loaded machine) that
Hi Tim!
First of all: thanks a lot for answering! :-)
Could you try running your two MPI jobs with fewer procs each,
say 2 or 3 each instead of 4, so that there are a few extra cores available.
This problem occurrs with any number of procs.
Also, what happens to the checkpointing of one MP
Hello Matthias,
Hopefully Josh will chime in shortly. But I have one suggestion to
help diagnose
this. Could you try running your two MPI jobs with fewer procs each,
say 2 or 3 each instead of 4, so that there are a few extra cores available.
I know that isn't a solution, but it may help us diagn
Hi!
I'm using the development version of OMPI from SVN (rev. 19857)
for executing MPI jobs on my cluster system. I'm particularly using
the checkpoint and restart feature, basing on the currentmost version
of BLCR.
The checkpointing is working pretty fine as long as I only execute
a single job o