Hellmüller  Roman <hroman <at> student.ethz.ch> writes:

> 
> Hi
> 
> I'm trying to get fault tolerant ompi running on our cluster for my 
semesterthesis.
> 
> Build & compile were successful, blcr checkpointing works. openmpi 1.5.3, 
> blcr 
0.8.2
> 
> Now i'm trying to set up the SELF checkpointing. the example from
> http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can run 
the application and
> also do checkpoints, but restarting won't work.  I got the following error by 
doning as sugested:
> 
> mpicc my-app.c -export -export-dynamic -o my-app
> 
> mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app
> 
> hroman <at> cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
> --------------------------------------------------------------------------
> Error: Unable to obtain the proper restart command to restart from the
>        checkpoint file (opal_snapshot_0.ckpt). Returned -1.
> 
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> Error: Unable to obtain the proper restart command to restart from the
>        checkpoint file (opal_snapshot_1.ckpt). Returned -1.
> 
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> 
> i also tryed around with setting the path in the example file (restart_path 
variable), changing the
> checkpoint directorys, and running the application in different directorys...
> 
> do you have an idea where the error could be?
> 
> here
> 
http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz<http://n.ethz.ch/%7Eh
roman/downloads/ompi_mailinglist.tar.gz>
> (40MB) you'll find the library and the build of openmpi & blcr as well as the 
env variables and the output of
> ompi_info. there is one for the login and the other for the compute nodes due 
to different kernels.  and here
> 
http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz<http:/
/n.ethz.ch/%7Ehroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz>
> there is the produced checkpoint. please let me know if more outputs are 
needed.
> 
> cheers
> roman
> 

Hi Roman,

Try putting name of your executable at end of the path.
char restart_path[128] = "/full/path/to/personal-cr"; 
Here 'personal-cr' is executable.

I hope it helps.

Kind regards,
Faisal


Reply via email to