Hi Toan Thx for your suggestion. It gives me the following result, which does not tell anything more.
hroman@cbl1 ~/checkpoints $ ompi-restart -v -machinefile ../semesterthesis/code/code2_self_example/my-hroman-cr-file.ckpt om pi_global_snapshot_28952.ckpt/ [cbl1:28974] Checking for the existence of (/home/hroman/checkpoints/ompi_global_snapshot_28952.ckpt) [cbl1:28974] Restarting from file (ompi_global_snapshot_28952.ckpt/) [cbl1:28974] Exec in self ssh: connect to host 15 port 22: Invalid argument -------------------------------------------------------------------------- A daemon (pid 28975) died unexpectedly with status 255 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- hroman@cbl1 ~/checkpoints $ echo $LD_LIBRARY_PATH /cluster/sw/blcr/0.8.2/x86_64/gcc//lib:/cluster/sw/openmpi/1.5.3_ft/x86_64/gcc/lib:/opt/intel/Compiler/11.1/056/lib/intel64 The library path seems to be ok or should it look different? do you have another idea? cheers roman ________________________________ Von: users-boun...@open-mpi.org [users-boun...@open-mpi.org]" im Auftrag von "Nguyen Toan [nguyentoan1...@gmail.com] Gesendet: Mittwoch, 6. April 2011 13:20 Bis: Open MPI Users Betreff: Re: [OMPI users] openmpi self checkpointing - error while running example Hi Roman, Did you try to checkpoint and restart with the parameter "-machinefile". It may work. Regards, Nguyen Toan On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman <hro...@student.ethz.ch<mailto:hro...@student.ethz.ch>> wrote: Hi I'm trying to get fault tolerant ompi running on our cluster for my semesterthesis. Build & compile were successful, blcr checkpointing works. openmpi 1.5.3, blcr 0.8.2 Now i'm trying to set up the SELF checkpointing. the example from http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can run the application and also do checkpoints, but restarting won't work. I got the following error by doning as sugested: mpicc my-app.c -export -export-dynamic -o my-app mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app hroman@cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/ -------------------------------------------------------------------------- Error: Unable to obtain the proper restart command to restart from the checkpoint file (opal_snapshot_0.ckpt). Returned -1. -------------------------------------------------------------------------- -------------------------------------------------------------------------- Error: Unable to obtain the proper restart command to restart from the checkpoint file (opal_snapshot_1.ckpt). Returned -1. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- i also tryed around with setting the path in the example file (restart_path variable), changing the checkpoint directorys, and running the application in different directorys... do you have an idea where the error could be? here http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz<http://n.ethz.ch/%7Ehroman/downloads/ompi_mailinglist.tar.gz> (40MB) you'll find the library and the build of openmpi & blcr as well as the env variables and the output of ompi_info. there is one for the login and the other for the compute nodes due to different kernels. and here http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz<http://n.ethz.ch/%7Ehroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz> there is the produced checkpoint. please let me know if more outputs are needed. cheers roman _______________________________________________ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/users