So I recently hit this same problem while doing some scalability
testing. I experimented with adding the --no-restore-pid option, but
found the same problem as you mention. Unfortunately, the problem is
with BLCR, not Open MPI.
BLCR will restart the process with a new PID, but the value returned
from getpid() is the old PID, not the new one. So when we connect the
daemon and the newly restarted process they are exchanging an invalid
PID. This eventually leads to ompi-checkpoint waiting for a PID to
respond that may not exist on the machine.
I am working on a bug report for BLCR at the moment. Once it is fixed
on that side, then I would be happy to add a -no-restore-pid like
option to the Open MPI C/R system.
-- Josh
On May 14, 2010, at 11:34 AM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com
> wrote:
Hi
I am using open mpi v1.3.4 with BLCR 0.8.2. I have been testing my
openmpi based program on a 3-node cluster (each node is a Intel
Nehalem based dual quad core) and I have been successful in
checkpointing and restarting the program successfully multiple times.
Recently I moved to a 15 node cluster with the same configuration
and I started seeing the problem with ompi-restart.
Ompi-checkpoint gets completed successfully and I terminate the
program after that. I have ensured that there are no MPI processes
before I restarted. When I restart using ompi-restart, I get the
error in restarting few of the MPI processes and the error message
is “found pid 4185 in use; Restart failed: Device or Resource
busy” (of course with different pid numbers). What I found was that
when the MPI process was restarted, it got restarted on a different
node than what it was running before termination and found that it
cannot reuse the pid.
Unlike cr_restart (BLCR), ompi-restart doesn’t have an option to say
not to use the same pid with option such as “--no-restore-pid”.
Since ompi-restart in turn calls cr_restart, I tried to alias
cr_restart to “cr_restart --no-restore-pid”. This actually made the
problem “pid in use” go away and the process completes successfully.
However if I call ompi-checkpoint on the restarted open MPI job,
both the openMPI job (all MPI processes) and the checkpoint command
hang forever. I guess this is because of the fact that ompi-restart
has different set of pids compared to the actual pids that are
running.
Long story short, I am stuck with this problem as I cannot get the
original pids during restart.
I really appreciate if you have any other options to share with me
which I can use to overcome this problem.
Thanks
Ananda
Please do not print this email unless it is absolutely necessary.
The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of
the addressee(s) and may contain proprietary, confidential or
privileged information. If you are not the intended recipient, you
should not disseminate, distribute or copy this e-mail. Please
notify the sender immediately and destroy all copies of this message
and any attachments.
WARNING: Computer viruses can be transmitted via email. The
recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus transmitted by this email.
www.wipro.com
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users