Thank you very much for your comments. I worked around the problem so I
don't need MPI_Cancel anymore.
Hi slimtimmy
I have been involved in several of the MPI Forum's discussions of how
MPI_Cancel should work and I agree with your interpretation of the
standard. By my reading of the standard
Hi Josh:
I am running on linux fedora core 7 kernel: 2.6.23.15-80.fc7
The machine is dual-core with shared memory so it's not even a cluster.
I downloaded r18208 and built it with the following options:
./configure --prefix=/usr/local/openmpi-with-checkpointing-r18208 --
with-ft=cr --with-blc
This problem has come up in the past and may have been fixed since
r14519. Can you update to r18208 and see if the error still occurs?
A few other questions that will help me try to reproduce the problem.
Can you tell me more about the configuration of the system you are
running on (number
Thanks Josh, I tried what you suggested with my existing r14519, and I
was able to checkpoint the restarted job but was never able to restart
it. I looked up the PID for 'orterun' and checkpointed the restarted
job but when I try to restart from that point I get the following error:
ompi-re
When you use 'ompi-restart' to restart a job it fork/execs the
completely new job using the restarted processes for the ranks.
However instead of calling the 'mpirun' process ompi-restart currently
calls 'orterun'. These two programs are exactly the same (mpirun is a
symbolic link to orteru
Dear all, I installed the developer's version r14519 and was able to
get it running. I successfully checkpointed a parallel job and
restarted it. My question is how can I checkpoint the restarted job?
The problem is once the original job is terminated and restarted later
on, the mpirun does