Hi Roman, Did you try to checkpoint and restart with the parameter "-machinefile". It may work.
Regards, Nguyen Toan On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman <hro...@student.ethz.ch>wrote: > Hi > > I'm trying to get fault tolerant ompi running on our cluster for my > semesterthesis. > > Build & compile were successful, blcr checkpointing works. openmpi 1.5.3, > blcr 0.8.2 > > Now i'm trying to set up the SELF checkpointing. the example from > http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can > run the application and also do checkpoints, but restarting won't work. I > got the following error by doning as sugested: > > mpicc my-app.c -export -export-dynamic -o my-app > > mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app > > hroman@cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/ > -------------------------------------------------------------------------- > Error: Unable to obtain the proper restart command to restart from the > checkpoint file (opal_snapshot_0.ckpt). Returned -1. > > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > Error: Unable to obtain the proper restart command to restart from the > checkpoint file (opal_snapshot_1.ckpt). Returned -1. > > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > > i also tryed around with setting the path in the example file (restart_path > variable), changing the checkpoint directorys, and running the application > in different directorys... > > do you have an idea where the error could be? > > here http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz< > http://n.ethz.ch/%7Ehroman/downloads/ompi_mailinglist.tar.gz> (40MB) > you'll find the library and the build of openmpi & blcr as well as the env > variables and the output of ompi_info. there is one for the login and the > other for the compute nodes due to different kernels. and here > http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz< > http://n.ethz.ch/%7Ehroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz> > there is the produced checkpoint. please let me know if more outputs are > needed. > > cheers > roman > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >