I probably won't have an opportunity to work on reproducing this on the 1.4.2. The trunk has a bunch of bug fixes that probably will not be backported to the 1.4 series (things have changed too much since that branch). So I would suggest trying the 1.5 series.
-- Josh On Aug 13, 2010, at 10:12 AM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com> wrote: > Josh > > I am having problems compiling the sources from the latest trunk. It > complains of libgomp.spec missing even though that file exists on my > system. I will see if I have to change any other environment variables > to have a successful compilation. I will keep you posted. > > BTW, were you successful in reproducing the problem on a system with > OpenMPI 1.4.2? > > Thanks > Ananda > -----Original Message----- > Date: Thu, 12 Aug 2010 09:12:26 -0400 > From: Joshua Hursey <jjhur...@open-mpi.org> > Subject: Re: [OMPI users] Checkpointing mpi4py program > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <1f1445ab-9208-4ef0-af25-5926bd53c...@open-mpi.org> > Content-Type: text/plain; charset=us-ascii > > Can you try this with the current trunk (r23587 or later)? > > I just added a number of new features and bug fixes, and I would be > interested to see if it fixes the problem. In particular I suspect that > this might be related to the Init/Finalize bounding of the checkpoint > region. > > -- Josh > > On Aug 10, 2010, at 2:18 PM, <ananda.mu...@wipro.com> > <ananda.mu...@wipro.com> wrote: > >> Josh >> >> Please find attached is the python program that reproduces the hang > that >> I described. Initial part of this file describes the prerequisite >> modules and the steps to reproduce the problem. Please let me know if >> you have any questions in reproducing the hang. >> >> Please note that, if I add the following lines at the end of the > program >> (in case sleep_time is True), the problem disappears ie; program > resumes >> successfully after successful completion of checkpoint. >> # Add following lines at the end for sleep_time is True >> else: >> time.sleep(0.1) >> # End of added lines >> >> >> Thanks a lot for your time in looking into this issue. >> >> Regards >> Ananda >> >> Ananda B Mudar, PMP >> Senior Technical Architect >> Wipro Technologies >> Ph: 972 765 8093 >> ananda.mu...@wipro.com >> >> >> -----Original Message----- >> Date: Mon, 9 Aug 2010 16:37:58 -0400 >> From: Joshua Hursey <jjhur...@open-mpi.org> >> Subject: Re: [OMPI users] Checkpointing mpi4py program >> To: Open MPI Users <us...@open-mpi.org> >> Message-ID: <270bd450-743a-4662-9568-1fedfcc6f...@open-mpi.org> >> Content-Type: text/plain; charset=windows-1252 >> >> I have not tried to checkpoint an mpi4py application, so I cannot say >> for sure if it works or not. You might be hitting something with the >> Python runtime interacting in an odd way with either Open MPI or BLCR. >> >> Can you attach a debugger and get a backtrace on a stuck checkpoint? >> That might show us where things are held up. >> >> -- Josh >> >> >> On Aug 9, 2010, at 4:04 PM, <ananda.mu...@wipro.com> >> <ananda.mu...@wipro.com> wrote: >> >>> Hi >>> >>> I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR >> 0.8.2. When I run ompi-checkpoint on the program written using mpi4py, > I >> see that program doesn?t resume sometimes after successful checkpoint >> creation. This doesn?t occur always meaning the program resumes after >> successful checkpoint creation most of the time and completes >> successfully. Has anyone tested the checkpoint/restart functionality >> with mpi4py programs? Are there any best practices that I should keep > in >> mind while checkpointing mpi4py programs? >>> >>> Thanks for your time >>> - Ananda >>> Please do not print this email unless it is absolutely necessary. >>> >>> The information contained in this electronic message and any >> attachments to this message are intended for the exclusive use of the >> addressee(s) and may contain proprietary, confidential or privileged >> information. If you are not the intended recipient, you should not >> disseminate, distribute or copy this e-mail. Please notify the sender >> immediately and destroy all copies of this message and any > attachments. >>> >>> WARNING: Computer viruses can be transmitted via email. The recipient >> should check this email and any attachments for the presence of > viruses. >> The company accepts no liability for any damage caused by any virus >> transmitted by this email. >>> >>> www.wipro.com >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > Please do not print this email unless it is absolutely necessary. > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. > > WARNING: Computer viruses can be transmitted via email. The recipient should > check this email and any attachments for the presence of viruses. The company > accepts no liability for any damage caused by any virus transmitted by this > email. > > www.wipro.com > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >