I just fixed the --stop bug that you highlighted in r23627.

As far as the mpi4py program, I don't really know what to suggest. I don't have 
a setup to test this locally and am completely unfamiliar with mpi4py. Can you 
reproduce this with just a C program?

-- Josh

On Aug 16, 2010, at 12:25 PM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com> 
wrote:

> Josh
>  
> I have one more update on my observation while analyzing this issue.
>  
> Just to refresh, I am using openmpi-trunk release 23596 with mpi4py-1.2.1 and 
> BLCR 0.8.2. When I checkpoint the python script written using mpi4py, the 
> program doesn’t progress after the checkpoint is taken successfully. I tried 
> it with openmpi 1.4.2 and then tried it with the latest trunk version as 
> suggested. I see the similar behavior in both the releases.
>  
> I have one more interesting observation which I thought may be useful. I 
> tried the “-stop” option of ompi-checkpoint (trunk version) and the mpirun 
> prints the following error messages when I run the command “ompi-checkpoint 
> –stop –v <pid of mpirun>”:
>  
> ==== Error messages in the window where mpirun command was running START 
> ======================================
> [hpdcnln001:15148] Error: (   app) Passed an invalid handle (0) [5 
> ="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"]
> [hpdcnln001:15148] [[37739,1],2] ORTE_ERROR_LOG: Error in file 
> ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253
> [hpdcnln001:15149] Error: (   app) Passed an invalid handle (0) [5 
> ="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"]
> [hpdcnln001:15149] [[37739,1],3] ORTE_ERROR_LOG: Error in file 
> ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253
> [hpdcnln001:15146] Error: (   app) Passed an invalid handle (0) [5 
> ="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"]
> [hpdcnln001:15146] [[37739,1],0] ORTE_ERROR_LOG: Error in file 
> ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253
> [hpdcnln001:15147] Error: (   app) Passed an invalid handle (0) [5 
> ="/tmp/openmpi-sessions-amudar@hpdcnln001_0/37739/1"]
> [hpdcnln001:15147] [[37739,1],1] ORTE_ERROR_LOG: Error in file 
> ../../../../../orte/mca/sstore/central/sstore_central_module.c at line 253
> ==== Error messages in the window where mpirun command was running END 
> ======================================
>  
> Please note that the checkpoint image was created at the end of it. However 
> when I run the command “kill –CONT <pid of mpirun>”, it fails to move forward 
> which is same as the original problem I have reported.
>  
> Let me know if you need any additional information.
>  
> Thanks for your time in advance
>  
> -          Ananda
>  
> Ananda B Mudar, PMP
> Senior Technical Architect
> Wipro Technologies
> Ph: 972 765 8093
> ananda.mu...@wipro.com
>  
> From: Ananda Babu Mudar (WT01 - Energy and Utilities) 
> Sent: Sunday, August 15, 2010 11:25 PM
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Checkpointing mpi4py program
> Importance: High
>  
> Josh
> 
> I tried running the mpi4py program with the latest trunk version of openmpi. 
> I have compiled openmpi-1.7a1r23596 from trunk and recompiled mpi4py to use 
> this library. Unfortunately I see the same behavior as I have seen with 
> openmpi 1.4.2 ie; checkpoint will be successful but the program doesn’t 
> proceed after that.
> 
> I have attached the stack traces of all the MPI processes that are part of 
> the mpirun. I really appreciate if you can take a look at the stack trace and 
> let m e know the potential problem. I am kind of stuck at this point and need 
> your assistance to move forward. Please let me know if you need any 
> additional information.
> 
> Thanks for your time in advance
> 
> Thanks
> 
> Ananda
> 
> -----Original Message----- 
> Subject: Re: [OMPI users] Checkpointing mpi4py program
> From: Joshua Hursey (jjhursey_at_[hidden])
> Date: 2010-08-13 12:28:31
> 
> Nope. I probably won't get to it for a while. I'll let you know if I do.
> 
> On Aug 13, 2010, at 12:17 PM, <ananda.mudar_at_[hidden]> 
> <ananda.mudar_at_[hidden]> wrote:
> 
> > OK, I will do that. 
> > 
> > But did you try this program on a system where the latest trunk is 
> > installed? Were you successful in checkpointing? 
> > 
> > - Ananda 
> > -----Original Message----- 
> > Message: 9 
> > Date: Fri, 13 Aug 2010 10:21:29 -0400 
> > From: Joshua Hursey <jjhursey_at_[hidden]> 
> > Subject: Re: [OMPI users] users Digest, Vol 1658, Issue 2 
> > To: Open MPI Users <users_at_[hidden]> 
> > Message-ID: <7A43615B-A462-4C72-8112-496653D8F0A0_at_[hidden]> 
> > Content-Type: text/plain; charset=us-ascii 
> > 
> > I probably won't have an opportunity to work on reproducing this on the 
> > 1.4.2. The trunk has a bunch of bug fixes that probably will not be 
> > backported to the 1.4 series (things have changed too much since that 
> > branch). So I would suggest trying the 1.5 series. 
> > 
> > -- Josh 
> > 
> > On Aug 13, 2010, at 10:12 AM, <ananda.mudar_at_[hidden]> 
> > <ananda.mudar_at_[hidden]> wrote: 
> > 
> >> Josh 
> >> 
> >> I am having problems compiling the sources from the latest trunk. It 
> >> complains of libgomp.spec missing even though that file exists on my 
> >> system. I will see if I have to change any other environment variables 
> >> to have a successful compilation. I will keep you posted. 
> >> 
> >> BTW, were you successful in reproducing the problem on a system with 
> >> OpenMPI 1.4.2? 
> >> 
> >> Thanks 
> >> Ananda 
> >> -----Original Message----- 
> >> Date: Thu, 12 Aug 2010 09:12:26 -0400 
> >> From: Joshua Hursey <jjhursey_at_[hidden]> 
> >> Subject: Re: [OMPI users] Checkpointing mpi4py program 
> >> To: Open MPI Users <users_at_[hidden]> 
> >> Message-ID: <1F1445AB-9208-4EF0-AF25-5926BD53C7E1_at_[hidden]> 
> >> Content-Type: text/plain; charset=us-ascii 
> >> 
> >> Can you try this with the current trunk (r23587 or later)? 
> >> 
> >> I just added a number of new features and bug fixes, and I would be 
> >> interested to see if it fixes the problem. In particular I suspect 
> > that 
> >> this might be related to the Init/Finalize bounding of the checkpoint 
> >> region. 
> >> 
> >> -- Josh 
> >> 
> >> On Aug 10, 2010, at 2:18 PM, <ananda.mudar_at_[hidden]> 
> >> <ananda.mudar_at_[hidden]> wrote: 
> >> 
> >>> Josh 
> >>> 
> >>> Please find attached is the python program that reproduces the hang 
> >> that 
> >>> I described. Initial part of this file describes the prerequisite 
> >>> modules and the steps to reproduce the problem. Please let me know if 
> >>> you have any questions in reproducing the hang. 
> >>> 
> >>> Please note that, if I add the following lines at the end of the 
> >> program 
> >>> (in case sleep_time is True), the problem disappears ie; program 
> >> resumes 
> >>> successfully after successful completion of checkpoint. 
> >>> # Add following lines at the end for sleep_time is True 
> >>> else: 
> >>> time.sleep(0.1) 
> >>> # End of added lines 
> >>> 
> >>> 
> >>> Thanks a lot for your time in looking into this issue. 
> >>> 
> >>> Regards 
> >>> Ananda 
> >>> 
> >>> Ananda B Mudar, PMP 
> >>> Senior Technical Architect 
> >>> Wipro Technologies 
> >>> Ph: 972 765 8093              972 765 8093       
> >>> ananda.mudar_at_[hidden] 
> >>> 
> >>> 
> >>> -----Original Message----- 
> >>> Date: Mon, 9 Aug 2010 16:37:58 -0400 
> >>> From: Joshua Hursey <jjhursey_at_[hidden]> 
> >>> Subject: Re: [OMPI users] Checkpointing mpi4py program 
> >>> To: Open MPI Users <users_at_[hidden]> 
> >>> Message-ID: <270BD450-743A-4662-9568-1FEDFCC6F9C6_at_[hidden]> 
> >>> Content-Type: text/plain; charset=windows-1252 
> >>> 
> >>> I have not tried to checkpoint an mpi4py application, so I cannot say 
> >>> for sure if it works or not. You might be hitting something with the 
> >>> Python runtime interacting in an odd way with either Open MPI or 
> > BLCR. 
> >>> 
> >>> Can you attach a debugger and get a backtrace on a stuck checkpoint? 
> >>> That might show us where things are held up. 
> >>> 
> >>> -- Josh 
> >>> 
> >>> 
> >>> On Aug 9, 2010, at 4:04 PM, <ananda.mudar_at_[hidden]> 
> >>> <ananda.mudar_at_[hidden]> wrote: 
> >>> 
> >>>> Hi 
> >>>> 
> >>>> I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR 
> >>> 0.8.2. When I run ompi-checkpoint on the program written using 
> > mpi4py, 
> >> I 
> >>> see that program doesn?t resume sometimes after successful checkpoint 
> >>> creation. This doesn?t occur always meaning the program resumes after 
> >>> successful checkpoint creation most of the time and completes 
> >>> successfully. Has anyone tested the checkpoint/restart functionality 
> >>> with mpi4py programs? Are there any best practices that I should keep 
> >> in 
> >>> mind while checkpointing mpi4py programs? 
> >>>> 
> >>>> Thanks for your time 
> >>>> - Ananda 
> >>>> Please do not print this email unless it is absolutely necessary. 
> >>>> 
> >>>> The information contained in this electronic message and any 
> >>> attachments to this message are intended for the exclusive use of the 
> >>> addressee(s) and may contain proprietary, confidential or privileged 
> >>> information. If you are not the intended recipient, you should not 
> >>> disseminate, distribute or copy this e-mail. Please notify the sender 
> >>> immediately and destroy all copies of this message and any 
> >> attachments. 
> >>>> 
> >>>> WARNING: Computer viruses can be transmitted via email. The 
> > recipient 
> >>> should check this email and any attachments for the presence of 
> >> viruses. 
> >>> The company accepts no liability for any damage caused by any virus 
> >>> transmitted by this email. 
> >>>> 
> >>>> www.wipro.com 
> >>>> 
> >>>> _______________________________________________ 
> >>>> users mailing list 
> >>>> users_at_[hidden] 
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users 
> >> 
> >> Please do not print this email unless it is absolutely necessary. 
> >> 
> >> The information contained in this electronic message and any 
> > attachments to this message are intended for the exclusive use of the 
> > addressee(s) and may contain proprietary, confidential or privileged 
> > information. If you are not the intended recipient, you should not 
> > disseminate, distribute or copy this e-mail. Please notify the sender 
> > immediately and destroy all copies of this message and any attachments. 
> >> 
> >> WARNING: Computer viruses can be transmitted via email. The recipient 
> > should check this email and any attachments for the presence of viruses. 
> > The company accepts no liability for any damage caused by any virus 
> > transmitted by this email. 
> >> 
> >> www.wipro.com 
> >> 
> >> _______________________________________________ 
> >> users mailing list 
> >> users_at_[hidden] 
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users 
> > 
> > Please do not print this email unless it is absolutely necessary. 
> > 
> > The information contained in this electronic message and any attachments to 
> > this message are intended for the exclusive use of the addressee(s) and may 
> > contain proprietary, confidential or privileged information. If you are not 
> > the intended recipient, you should not disseminate, distribute or copy this 
> > e-mail. Please notify the sender immediately and destroy all copies of this 
> > message and any attachments. 
> > 
> > WARNING: Computer viruses can be transmitted via email. The recipient 
> > should check this email and any attachments for the presence of viruses. 
> > The company accepts no liability for any damage caused by any virus 
> > transmitted by this email. 
> > 
> > www.wipro.com 
> > 
> > _______________________________________________ 
> > users mailing list 
> > users_at_[hidden] 
> > http://www.open-mpi.org/mailman/listinfo.cgi/users 
> >
> 
> Please do not print this email unless it is absolutely necessary.
> 
> The information contained in this electronic message and any attachments to 
> this message are intended for the exclusive use of the addressee(s) and may 
> contain proprietary, confidential or privileged information. If you are not 
> the intended recipient, you should not disseminate, distribute or copy this 
> e-mail. Please notify the sender immediately and destroy all copies of this 
> message and any attachments.
> 
> WARNING: Computer viruses can be transmitted via email. The recipient should 
> check this email and any attachments for the presence of viruses. The company 
> accepts no liability for any damage caused by any virus transmitted by this 
> email.
> 
> www.wipro.com
> 
> <ATT00001..txt>


Reply via email to