Hi Toan

Thx for your suggestion. It gives me the following result, which does not tell 
anything more.

hroman@cbl1 ~/checkpoints $ ompi-restart -v  -machinefile 
../semesterthesis/code/code2_self_example/my-hroman-cr-file.ckpt   om
pi_global_snapshot_28952.ckpt/
[cbl1:28974] Checking for the existence of 
(/home/hroman/checkpoints/ompi_global_snapshot_28952.ckpt)
[cbl1:28974] Restarting from file (ompi_global_snapshot_28952.ckpt/)
[cbl1:28974]      Exec in self
ssh: connect to host 15 port 22: Invalid argument
--------------------------------------------------------------------------
A daemon (pid 28975) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
hroman@cbl1 ~/checkpoints $ echo $LD_LIBRARY_PATH
/cluster/sw/blcr/0.8.2/x86_64/gcc//lib:/cluster/sw/openmpi/1.5.3_ft/x86_64/gcc/lib:/opt/intel/Compiler/11.1/056/lib/intel64

The library path seems to be ok or should it look different? do you have 
another idea?
cheers
roman

________________________________
Von: users-boun...@open-mpi.org [users-boun...@open-mpi.org]" im Auftrag von 
"Nguyen Toan [nguyentoan1...@gmail.com]
Gesendet: Mittwoch, 6. April 2011 13:20
Bis: Open MPI Users
Betreff: Re: [OMPI users] openmpi self checkpointing - error while running 
example

Hi Roman,

Did you try to checkpoint and restart with the parameter "-machinefile". It may 
work.

Regards,
Nguyen Toan

On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman 
<hro...@student.ethz.ch<mailto:hro...@student.ethz.ch>> wrote:
Hi

I'm trying to get fault tolerant ompi running on our cluster for my 
semesterthesis.

Build & compile were successful, blcr checkpointing works. openmpi 1.5.3, blcr 
0.8.2

Now i'm trying to set up the SELF checkpointing. the example from 
http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can run the 
application and also do checkpoints, but restarting won't work.  I got the 
following error by doning as sugested:

mpicc my-app.c -export -export-dynamic -o my-app

mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app

hroman@cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
--------------------------------------------------------------------------
Error: Unable to obtain the proper restart command to restart from the
      checkpoint file (opal_snapshot_0.ckpt). Returned -1.

--------------------------------------------------------------------------
--------------------------------------------------------------------------
Error: Unable to obtain the proper restart command to restart from the
      checkpoint file (opal_snapshot_1.ckpt). Returned -1.

--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------

i also tryed around with setting the path in the example file (restart_path 
variable), changing the checkpoint directorys, and running the application in 
different directorys...

do you have an idea where the error could be?

here 
http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz<http://n.ethz.ch/%7Ehroman/downloads/ompi_mailinglist.tar.gz>
 (40MB) you'll find the library and the build of openmpi & blcr as well as the 
env variables and the output of ompi_info. there is one for the login and the 
other for the compute nodes due to different kernels.  and here 
http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz<http://n.ethz.ch/%7Ehroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz>
 there is the produced checkpoint. please let me know if more outputs are 
needed.

cheers
roman

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to