When you use 'ompi-restart' to restart a job it fork/execs the
completely new job using the restarted processes for the ranks.
However instead of calling the 'mpirun' process ompi-restart currently
calls 'orterun'. These two programs are exactly the same (mpirun is a
symbolic link to orterun). So if you look for the PID of 'orterun'
that can be used to checkpoint the process.
However it is confusing that Open MPI makes this switch. So I
committed in r18208 a fix for this that uses the 'mpirun' binary name
instead of the 'orterun' binary name. This fits with the typical use
case of checkpoint/restart in Open MPI in which users expect to find
the 'mpirun' process on restart instead of the lesser known 'orterun'
process.
Sorry for the confusion.
Josh
On Apr 18, 2008, at 1:14 AM, Tamer wrote:
Dear all, I installed the developer's version r14519 and was able to
get it running. I successfully checkpointed a parallel job and
restarted it. My question is how can I checkpoint the restarted job?
The problem is once the original job is terminated and restarted later
on, the mpirun does not exist anymore (ps -efa|grep mpirun) and hence
I do not know which PID I should use when I run the ompi-checkpoint on
the restarted job. Any help would be greatly appreciated.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users