So I recently hit this same problem while doing some scalability testing. I experimented with adding the --no-restore-pid option, but found the same problem as you mention. Unfortunately, the problem is with BLCR, not Open MPI.

BLCR will restart the process with a new PID, but the value returned from getpid() is the old PID, not the new one. So when we connect the daemon and the newly restarted process they are exchanging an invalid PID. This eventually leads to ompi-checkpoint waiting for a PID to respond that may not exist on the machine.

I am working on a bug report for BLCR at the moment. Once it is fixed on that side, then I would be happy to add a -no-restore-pid like option to the Open MPI C/R system.

-- Josh

On May 14, 2010, at 11:34 AM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com > wrote:

Hi

I am using open mpi v1.3.4 with BLCR 0.8.2. I have been testing my openmpi based program on a 3-node cluster (each node is a Intel Nehalem based dual quad core) and I have been successful in checkpointing and restarting the program successfully multiple times.

Recently I moved to a 15 node cluster with the same configuration and I started seeing the problem with ompi-restart.

Ompi-checkpoint gets completed successfully and I terminate the program after that. I have ensured that there are no MPI processes before I restarted. When I restart using ompi-restart, I get the error in restarting few of the MPI processes and the error message is “found pid 4185 in use; Restart failed: Device or Resource busy” (of course with different pid numbers). What I found was that when the MPI process was restarted, it got restarted on a different node than what it was running before termination and found that it cannot reuse the pid.

Unlike cr_restart (BLCR), ompi-restart doesn’t have an option to say not to use the same pid with option such as “--no-restore-pid”. Since ompi-restart in turn calls cr_restart, I tried to alias cr_restart to “cr_restart --no-restore-pid”. This actually made the problem “pid in use” go away and the process completes successfully. However if I call ompi-checkpoint on the restarted open MPI job, both the openMPI job (all MPI processes) and the checkpoint command hang forever. I guess this is because of the fact that ompi-restart has different set of pids compared to the actual pids that are running.

Long story short, I am stuck with this problem as I cannot get the original pids during restart.

I really appreciate if you have any other options to share with me which I can use to overcome this problem.

Thanks
Ananda
Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

www.wipro.com

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to