com
>>>>>>>> +91-8149399160
>>>>>>>> ___
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listin
his. I've replied further below:
>
>
> - Original Message -
>> From: Joshua Hursey
> [...]
>> What other configure options are you passing to Open MPI? Specifically the
>> configure test will always fail if '--with-ft=cr' is not specified - by
>> defaul
What version of BLCR are you using?
What other configure options are you passing to Open MPI? Specifically the
configure test will always fail if '--with-ft=cr' is not specified - by default
Open MPI will only build the BLCR component if C/R FT is requested by the user.
Can you send a zip'ed up
here are also 2 sample result files (cpu.256^3.8N.*) which show the
> execution time difference between 2 cases.
> Hope you can take some time to find the problem.
> Thanks for your kindness.
>
> Best Regards,
> Nguyen Toan
>
> On Wed, Mar 2, 2011 at 3:00 AM, Joshua Hurs
eter you mentioned but it did not help, the unknown
> overhead still exists.
> Here I attach the output of 'ompi_info', both version 1.5 and 1.5.1.
> Hope you can find out the problem.
> Thank you.
>
> Regards,
> Nguyen Toan
>
> On Wed, Feb 9, 2011 at 11:08 PM, Josh
mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
nd MPI_Wait. Also I want to make only one checkpoint
> per application execution for my purpose, but the unknown overhead exists
> even when no checkpoint was taken.
>
> Do you have any other idea?
>
> Regards,
> Nguyen Toan
>
>
> On Wed, Feb 9, 2011 at 12:41 AM, Josh
ead, and how to eliminate it?
> Thanks.
>
> Regards,
> Nguyen
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
On Jan 27, 2011, at 9:47 AM, Reuti wrote:
> Am 27.01.2011 um 15:23 schrieb Joshua Hursey:
>
>> The current version of Open MPI does not support continued operation of an
>> MPI application after process failure within a job. If a process dies, so
>> will the MPI job. N
this group into
> a working communicator?
>
> Thanks,
> Kirk
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
Joshua Hursey
Postdoctoral
roup
> HPC Research Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
drei
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey
should
> check this email and any attachments for the presence of viruses. The company
> accepts no liability for any damage caused by any virus transmitted by this
> email.
>
> www.wipro.com
>
>
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey
Running - ...
> [blade02:27130] [221.25 / 221.71] Finished -
> ompi_global_snapshot_27115.ckpt
> Snapshot Ref.: 0 ompi_global_snapshot_27115.ckpt
>
> As you see, it takes 200+ secconds to checkpoint. btw, what the former and
> latter number represen
nting directly
to the shared file system causes the application to remain suspended until its
file is completely written, which may take a considerable amount of time
depending on the speed of the file system. Staging considerably reduces the
impact of checkpointing on application runtime.
I sugg
doc/html/FAQ.html#prelink
If that doesn't work then I would suggest trying the current Open MPI trunk.
There should not be any problem with using NFS, since this is occurring in
MPI_Init, this is well before we ever try to use the file system. I also test
with NFS, and local staging on a f
I am pleased to announce that Open MPI now supports checkpoint/restart process
migration and automatic recovery. This is in addition to our current support
for more traditional checkpoint/restart fault tolerance. These new features
were introduced in the Open MPI development trunk in commit r235
hed the stack traces of all the MPI processes that are part of
> the mpirun. I really appreciate if you can take a look at the stack trace and
> let m e know the potential problem. I am kind of stuck at this point and need
> your assistance to move forward. Please let me know if you
anda
> -Original Message-
> Message: 9
> Date: Fri, 13 Aug 2010 10:21:29 -0400
> From: Joshua Hursey
> Subject: Re: [OMPI users] users Digest, Vol 1658, Issue 2
> To: Open MPI Users
> Message-ID: <7a43615b-a462-4c72-8112-496653d8f...@open-mpi.org>
> Content-Typ
I will keep you posted.
>
> BTW, were you successful in reproducing the problem on a system with
> OpenMPI 1.4.2?
>
> Thanks
> Ananda
> -Original Message-
> Date: Thu, 12 Aug 2010 09:12:26 -0400
> From: Joshua Hursey
> Subject: Re: [OMPI users] Checkpoint
...@wipro.com
>
>
> -Original Message-
> Date: Mon, 9 Aug 2010 16:37:58 -0400
> From: Joshua Hursey
> Subject: Re: [OMPI users] Checkpointing mpi4py program
> To: Open MPI Users
> Message-ID: <270bd450-743a-4662-9568-1fedfcc6f...@open-mpi.org>
> Content-Ty
I have not tried to checkpoint an mpi4py application, so I cannot say for sure
if it works or not. You might be hitting something with the Python runtime
interacting in an odd way with either Open MPI or BLCR.
Can you attach a debugger and get a backtrace on a stuck checkpoint? That might
show
That is interesting. I cannot think of any reason why this might be causing a
problem just in Open MPI. popen() is similar to fork()/system() so you have to
be careful with interconnects that do not play nice with fork(), like openib.
But since it looks like you are excluding openib, this should
cr_thread_sleep_wait=1000
Which will throttle down the thread when the application is in the MPI library.
You might want to play around with these MCA parameters to tune the
aggressiveness of the C/R thread to your performance needs. In the mean time I
will look into finding better default para
There is some overhead involved when activating the current C/R functionality
in Open MPI due to the wrapping of the internal point-to-point stack. The
wrapper (CRCP framework) tracks the signature of each message (not the buffer,
so constant time for any size MPI message) so that when we need t
On Mar 4, 2010, at 8:17 AM, Fernando Lemos wrote:
> On Wed, Mar 3, 2010 at 10:24 PM, Fernando Lemos wrote:
>
>> Is there anything I can do to provide more information about this bug?
>> E.g. try to compile the code in the SVN trunk? I also have kept the
>> snapshots intact, I can tar them up an
On Mar 3, 2010, at 3:42 PM, Fernando Lemos wrote:
> On Wed, Mar 3, 2010 at 5:31 PM, Joshua Hursey wrote:
>
>>
>> Yes, ompi-restart should be printing a helpful message and exiting normally.
>> Thanks for the bug report. I believe that I have seen and fixed this on
On Mar 2, 2010, at 9:17 AM, Fernando Lemos wrote:
> On Sun, Feb 28, 2010 at 11:11 PM, Fernando Lemos
> wrote:
>> Hello,
>>
>>
>> I'm trying to come up with a fault tolerant OpenMPI setup for research
>> purposes. I'm doing some tests now, but I'm stuck with a segfault when
>> I try to restart
You can use the 'checkpoint to local disk' example to checkpoint and restart
without access to a globally shared storage devices. There is an example on the
website that does not use a globally mounted file system:
http://www.osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-local
What versi
On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote:
> Hi,
>
> I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have
> downloaded today. When I want to checkpoint I am having the following error
> message:
> [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line
On Jan 14, 2010, at 2:50 AM, Andreea Costea wrote:
> Hei there
>
> I have some questions regarding checkpoint/restart:
>
> 1. Until recently I thought that ompi-restart and ompi-restart are used to
> checkpoint a process inside an MPI application. Now I reread this and I
> realized that actua
The --preload-* options to 'mpirun' currently use the ssh/scp commands (or
rsh/rcp via an MCA parameter) to move files from the machine local to the
'mpirun' command to the compute nodes during launch. This assumes that you have
Open MPI already installed on all of the machines. It was an option
On Sep 25, 2009, at 7:10 AM, Mallikarjuna Shastry wrote:
dear sir
i am sending the details as follows
1. i am using openmpi-1.3.3 and blcr 0.8.2
2. i have installed blcr 0.8.2 first under /root/MS
3. then i installed openmpi 1.3.3 under /root/MS
4 i have configured and installed open mpi as
On Sep 16, 2009, at 8:30 AM, Marcin Stolarek wrote:
Hi,
It seems I solved my problem. Root of the error was, that I haven't
loaded blcr module. So I couldn't checkpoint even one therad
application.
I am glad to hear that you have things working now.
However I stil can't find MCA:blcr i
34 matches
Mail list logo