Dear all, I installed the developer's version r14519 and was able to get it running. I successfully checkpointed a parallel job and restarted it. My question is how can I checkpoint the restarted job? The problem is once the original job is terminated and restarted later on, the mpirun does not exist anymore (ps -efa|grep mpirun) and hence I do not know which PID I should use when I run the ompi-checkpoint on the restarted job. Any help would be greatly appreciated.

Reply via email to