I just fixed the --stop bug that you highlighted in r23627. As far as the mpi4py program, I don't really know what to suggest. I don't have a setup to test this locally and am completely unfamiliar with mpi4py. Can you reproduce this with just a C program?
-- Josh On Aug 16, 2010, at 12:25 PM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com> wrote: > Josh > > I have one more update on my observation while analyzing this issue. > > Just to refresh, I am using openmpi-trunk release 23596 with mpi4py-1.2.1 and > BLCR 0.8.2. When I checkpoint the python script written using mpi4py, the > program doesn’t progress after the checkpoint is taken successfully. I tried > it with openmpi 1.4.2 and then tried it with the latest trunk version as > suggested. I see the similar behavior in both the releases. > > I have one more interesting observation which I thought may be useful. I > tried the “-stop” option of ompi-checkpoint (trunk version) and the mpirun > prints the following error messages when I run the command “ompi-checkpoint > –stop –v <pid of mpirun>”: > > ==== Error messages in the window where mpirun command was running START > ====================================== > [hpdcnln001:15148] Error: ( app) Passed an invalid handle (0) [5 > ="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"] > [hpdcnln001:15148] [[37739,1],2] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253 > [hpdcnln001:15149] Error: ( app) Passed an invalid handle (0) [5 > ="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"] > [hpdcnln001:15149] [[37739,1],3] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253 > [hpdcnln001:15146] Error: ( app) Passed an invalid handle (0) [5 > ="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"] > [hpdcnln001:15146] [[37739,1],0] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253 > [hpdcnln001:15147] Error: ( app) Passed an invalid handle (0) [5 > ="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"] > [hpdcnln001:15147] [[37739,1],1] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253 > ==== Error messages in the window where mpirun command was running END > ====================================== > > Please note that the checkpoint image was created at the end of it. However > when I run the command “kill –CONT <pid of mpirun>”, it fails to move forward > which is same as the original problem I have reported. > > Let me know if you need any additional information. > > Thanks for your time in advance > > - Ananda > > Ananda B Mudar, PMP > Senior Technical Architect > Wipro Technologies > Ph: 972 765 8093 > ananda.mu...@wipro.com > > From: Ananda Babu Mudar (WT01 - Energy and Utilities) > Sent: Sunday, August 15, 2010 11:25 PM > To: us...@open-mpi.org > Subject: Re: [OMPI users] Checkpointing mpi4py program > Importance: High > > Josh > > I tried running the mpi4py program with the latest trunk version of openmpi. > I have compiled openmpi-1.7a1r23596 from trunk and recompiled mpi4py to use > this library. Unfortunately I see the same behavior as I have seen with > openmpi 1.4.2 ie; checkpoint will be successful but the program doesn’t > proceed after that. > > I have attached the stack traces of all the MPI processes that are part of > the mpirun. I really appreciate if you can take a look at the stack trace and > let m e know the potential problem. I am kind of stuck at this point and need > your assistance to move forward. Please let me know if you need any > additional information. > > Thanks for your time in advance > > Thanks > > Ananda > > -----Original Message----- > Subject: Re: [OMPI users] Checkpointing mpi4py program > From: Joshua Hursey (jjhursey_at_[hidden]) > Date: 2010-08-13 12:28:31 > > Nope. I probably won't get to it for a while. I'll let you know if I do. > > On Aug 13, 2010, at 12:17 PM, <ananda.mudar_at_[hidden]> > <ananda.mudar_at_[hidden]> wrote: > > > OK, I will do that. > > > > But did you try this program on a system where the latest trunk is > > installed? Were you successful in checkpointing? > > > > - Ananda > > -----Original Message----- > > Message: 9 > > Date: Fri, 13 Aug 2010 10:21:29 -0400 > > From: Joshua Hursey <jjhursey_at_[hidden]> > > Subject: Re: [OMPI users] users Digest, Vol 1658, Issue 2 > > To: Open MPI Users <users_at_[hidden]> > > Message-ID: <7A43615B-A462-4C72-8112-496653D8F0A0_at_[hidden]> > > Content-Type: text/plain; charset=us-ascii > > > > I probably won't have an opportunity to work on reproducing this on the > > 1.4.2. The trunk has a bunch of bug fixes that probably will not be > > backported to the 1.4 series (things have changed too much since that > > branch). So I would suggest trying the 1.5 series. > > > > -- Josh > > > > On Aug 13, 2010, at 10:12 AM, <ananda.mudar_at_[hidden]> > > <ananda.mudar_at_[hidden]> wrote: > > > >> Josh > >> > >> I am having problems compiling the sources from the latest trunk. It > >> complains of libgomp.spec missing even though that file exists on my > >> system. I will see if I have to change any other environment variables > >> to have a successful compilation. I will keep you posted. > >> > >> BTW, were you successful in reproducing the problem on a system with > >> OpenMPI 1.4.2? > >> > >> Thanks > >> Ananda > >> -----Original Message----- > >> Date: Thu, 12 Aug 2010 09:12:26 -0400 > >> From: Joshua Hursey <jjhursey_at_[hidden]> > >> Subject: Re: [OMPI users] Checkpointing mpi4py program > >> To: Open MPI Users <users_at_[hidden]> > >> Message-ID: <1F1445AB-9208-4EF0-AF25-5926BD53C7E1_at_[hidden]> > >> Content-Type: text/plain; charset=us-ascii > >> > >> Can you try this with the current trunk (r23587 or later)? > >> > >> I just added a number of new features and bug fixes, and I would be > >> interested to see if it fixes the problem. In particular I suspect > > that > >> this might be related to the Init/Finalize bounding of the checkpoint > >> region. > >> > >> -- Josh > >> > >> On Aug 10, 2010, at 2:18 PM, <ananda.mudar_at_[hidden]> > >> <ananda.mudar_at_[hidden]> wrote: > >> > >>> Josh > >>> > >>> Please find attached is the python program that reproduces the hang > >> that > >>> I described. Initial part of this file describes the prerequisite > >>> modules and the steps to reproduce the problem. Please let me know if > >>> you have any questions in reproducing the hang. > >>> > >>> Please note that, if I add the following lines at the end of the > >> program > >>> (in case sleep_time is True), the problem disappears ie; program > >> resumes > >>> successfully after successful completion of checkpoint. > >>> # Add following lines at the end for sleep_time is True > >>> else: > >>> time.sleep(0.1) > >>> # End of added lines > >>> > >>> > >>> Thanks a lot for your time in looking into this issue. > >>> > >>> Regards > >>> Ananda > >>> > >>> Ananda B Mudar, PMP > >>> Senior Technical Architect > >>> Wipro Technologies > >>> Ph: 972 765 8093 972 765 8093 > >>> ananda.mudar_at_[hidden] > >>> > >>> > >>> -----Original Message----- > >>> Date: Mon, 9 Aug 2010 16:37:58 -0400 > >>> From: Joshua Hursey <jjhursey_at_[hidden]> > >>> Subject: Re: [OMPI users] Checkpointing mpi4py program > >>> To: Open MPI Users <users_at_[hidden]> > >>> Message-ID: <270BD450-743A-4662-9568-1FEDFCC6F9C6_at_[hidden]> > >>> Content-Type: text/plain; charset=windows-1252 > >>> > >>> I have not tried to checkpoint an mpi4py application, so I cannot say > >>> for sure if it works or not. You might be hitting something with the > >>> Python runtime interacting in an odd way with either Open MPI or > > BLCR. > >>> > >>> Can you attach a debugger and get a backtrace on a stuck checkpoint? > >>> That might show us where things are held up. > >>> > >>> -- Josh > >>> > >>> > >>> On Aug 9, 2010, at 4:04 PM, <ananda.mudar_at_[hidden]> > >>> <ananda.mudar_at_[hidden]> wrote: > >>> > >>>> Hi > >>>> > >>>> I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR > >>> 0.8.2. When I run ompi-checkpoint on the program written using > > mpi4py, > >> I > >>> see that program doesn?t resume sometimes after successful checkpoint > >>> creation. This doesn?t occur always meaning the program resumes after > >>> successful checkpoint creation most of the time and completes > >>> successfully. Has anyone tested the checkpoint/restart functionality > >>> with mpi4py programs? Are there any best practices that I should keep > >> in > >>> mind while checkpointing mpi4py programs? > >>>> > >>>> Thanks for your time > >>>> - Ananda > >>>> Please do not print this email unless it is absolutely necessary. > >>>> > >>>> The information contained in this electronic message and any > >>> attachments to this message are intended for the exclusive use of the > >>> addressee(s) and may contain proprietary, confidential or privileged > >>> information. If you are not the intended recipient, you should not > >>> disseminate, distribute or copy this e-mail. Please notify the sender > >>> immediately and destroy all copies of this message and any > >> attachments. > >>>> > >>>> WARNING: Computer viruses can be transmitted via email. The > > recipient > >>> should check this email and any attachments for the presence of > >> viruses. > >>> The company accepts no liability for any damage caused by any virus > >>> transmitted by this email. > >>>> > >>>> www.wipro.com > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> users_at_[hidden] > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> Please do not print this email unless it is absolutely necessary. > >> > >> The information contained in this electronic message and any > > attachments to this message are intended for the exclusive use of the > > addressee(s) and may contain proprietary, confidential or privileged > > information. If you are not the intended recipient, you should not > > disseminate, distribute or copy this e-mail. Please notify the sender > > immediately and destroy all copies of this message and any attachments. > >> > >> WARNING: Computer viruses can be transmitted via email. The recipient > > should check this email and any attachments for the presence of viruses. > > The company accepts no liability for any damage caused by any virus > > transmitted by this email. > >> > >> www.wipro.com > >> > >> _______________________________________________ > >> users mailing list > >> users_at_[hidden] > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > Please do not print this email unless it is absolutely necessary. > > > > The information contained in this electronic message and any attachments to > > this message are intended for the exclusive use of the addressee(s) and may > > contain proprietary, confidential or privileged information. If you are not > > the intended recipient, you should not disseminate, distribute or copy this > > e-mail. Please notify the sender immediately and destroy all copies of this > > message and any attachments. > > > > WARNING: Computer viruses can be transmitted via email. The recipient > > should check this email and any attachments for the presence of viruses. > > The company accepts no liability for any damage caused by any virus > > transmitted by this email. > > > > www.wipro.com > > > > _______________________________________________ > > users mailing list > > users_at_[hidden] > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > Please do not print this email unless it is absolutely necessary. > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. > > WARNING: Computer viruses can be transmitted via email. The recipient should > check this email and any attachments for the presence of viruses. The company > accepts no liability for any damage caused by any virus transmitted by this > email. > > www.wipro.com > > <ATT00001..txt>