Re: [OMPI users] ompi-restart fails with "found pid in use"

Josh Hursey Tue, 18 May 2010 10:20:45 -0400

So I recently hit this same problem while doing some scalabilitytesting. I experimented with adding the --no-restore-pid option, butfound the same problem as you mention. Unfortunately, the problem iswith BLCR, not Open MPI.

BLCR will restart the process with a new PID, but the value returnedfrom getpid() is the old PID, not the new one. So when we connect thedaemon and the newly restarted process they are exchanging an invalidPID. This eventually leads to ompi-checkpoint waiting for a PID torespond that may not exist on the machine.

I am working on a bug report for BLCR at the moment. Once it is fixedon that side, then I would be happy to add a -no-restore-pid likeoption to the Open MPI C/R system.


-- Josh

On May 14, 2010, at 11:34 AM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com> wrote:

Hi
I am using open mpi v1.3.4 with BLCR 0.8.2. I have been testing myopenmpi based program on a 3-node cluster (each node is a IntelNehalem based dual quad core) and I have been successful incheckpointing and restarting the program successfully multiple times.
Recently I moved to a 15 node cluster with the same configurationand I started seeing the problem with ompi-restart.
Ompi-checkpoint gets completed successfully and I terminate theprogram after that. I have ensured that there are no MPI processesbefore I restarted. When I restart using ompi-restart, I get theerror in restarting few of the MPI processes and the error messageis “found pid 4185 in use; Restart failed: Device or Resourcebusy” (of course with different pid numbers). What I found was thatwhen the MPI process was restarted, it got restarted on a differentnode than what it was running before termination and found that itcannot reuse the pid.
Unlike cr_restart (BLCR), ompi-restart doesn’t have an option to saynot to use the same pid with option such as “--no-restore-pid”.Since ompi-restart in turn calls cr_restart, I tried to aliascr_restart to “cr_restart --no-restore-pid”. This actually made theproblem “pid in use” go away and the process completes successfully.However if I call ompi-checkpoint on the restarted open MPI job,both the openMPI job (all MPI processes) and the checkpoint commandhang forever. I guess this is because of the fact that ompi-restarthas different set of pids compared to the actual pids that arerunning.
Long story short, I am stuck with this problem as I cannot get theoriginal pids during restart.
I really appreciate if you have any other options to share with mewhich I can use to overcome this problem.
Thanks
Ananda
Please do not print this email unless it is absolutely necessary.
The information contained in this electronic message and anyattachments to this message are intended for the exclusive use ofthe addressee(s) and may contain proprietary, confidential orprivileged information. If you are not the intended recipient, youshould not disseminate, distribute or copy this e-mail. Pleasenotify the sender immediately and destroy all copies of this messageand any attachments.
WARNING: Computer viruses can be transmitted via email. Therecipient should check this email and any attachments for thepresence of viruses. The company accepts no liability for any damagecaused by any virus transmitted by this email.
www.wipro.com

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] ompi-restart fails with "found pid in use"

Reply via email to