Hi Roman,

Did you try to checkpoint and restart with the parameter "-machinefile". It
may work.

Regards,
Nguyen Toan

On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman <hro...@student.ethz.ch>wrote:

> Hi
>
> I'm trying to get fault tolerant ompi running on our cluster for my
> semesterthesis.
>
> Build & compile were successful, blcr checkpointing works. openmpi 1.5.3,
> blcr 0.8.2
>
> Now i'm trying to set up the SELF checkpointing. the example from
> http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can
> run the application and also do checkpoints, but restarting won't work.  I
> got the following error by doning as sugested:
>
> mpicc my-app.c -export -export-dynamic -o my-app
>
> mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app
>
> hroman@cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
> --------------------------------------------------------------------------
> Error: Unable to obtain the proper restart command to restart from the
>       checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> Error: Unable to obtain the proper restart command to restart from the
>       checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
>
> i also tryed around with setting the path in the example file (restart_path
> variable), changing the checkpoint directorys, and running the application
> in different directorys...
>
> do you have an idea where the error could be?
>
> here http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz<
> http://n.ethz.ch/%7Ehroman/downloads/ompi_mailinglist.tar.gz> (40MB)
> you'll find the library and the build of openmpi & blcr as well as the env
> variables and the output of ompi_info. there is one for the login and the
> other for the compute nodes due to different kernels.  and here
> http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz<
> http://n.ethz.ch/%7Ehroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz>
> there is the produced checkpoint. please let me know if more outputs are
> needed.
>
> cheers
> roman
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to