Re: [OMPI users] MPI_CANCEL

2008-04-18 Thread slimtimmy
Thank you very much for your comments. I worked around the problem so I don't need MPI_Cancel anymore. Hi slimtimmy I have been involved in several of the MPI Forum's discussions of how MPI_Cancel should work and I agree with your interpretation of the standard. By my reading of the standard

Re: [OMPI users] How to restart a job twice

2008-04-18 Thread Tamer
Hi Josh: I am running on linux fedora core 7 kernel: 2.6.23.15-80.fc7 The machine is dual-core with shared memory so it's not even a cluster. I downloaded r18208 and built it with the following options: ./configure --prefix=/usr/local/openmpi-with-checkpointing-r18208 -- with-ft=cr --with-blc

Re: [OMPI users] How to restart a job twice

2008-04-18 Thread Josh Hursey
This problem has come up in the past and may have been fixed since r14519. Can you update to r18208 and see if the error still occurs? A few other questions that will help me try to reproduce the problem. Can you tell me more about the configuration of the system you are running on (number

Re: [OMPI users] How to restart a job twice

2008-04-18 Thread Tamer
Thanks Josh, I tried what you suggested with my existing r14519, and I was able to checkpoint the restarted job but was never able to restart it. I looked up the PID for 'orterun' and checkpointed the restarted job but when I try to restart from that point I get the following error: ompi-re

Re: [OMPI users] How to restart a job twice

2008-04-18 Thread Josh Hursey
When you use 'ompi-restart' to restart a job it fork/execs the completely new job using the restarted processes for the ranks. However instead of calling the 'mpirun' process ompi-restart currently calls 'orterun'. These two programs are exactly the same (mpirun is a symbolic link to orteru

[OMPI users] How to restart a job twice

2008-04-18 Thread Tamer
Dear all, I installed the developer's version r14519 and was able to get it running. I successfully checkpointed a parallel job and restarted it. My question is how can I checkpoint the restarted job? The problem is once the original job is terminated and restarted later on, the mpirun does