Hi
I'm trying to get fault tolerant ompi running on our cluster for my
semesterthesis.
On the login node i was successful, checkpointing works.
Since the compute nodes have different kernels, i had to compile blcr on the
compute nodes again. blcr on the compute nodes works. after that i installed
openmpi (1.5.3) on the compute nodes. Letting a normal mpi program run works.
also letting it run with -am ft-enable-cr works, but as soon as i would like
to take a checkpoint it crashes:
hroman@node15 ~/semesterthesis/code/code1_heat1d $ mpirun -np 4 -am
ft-enable-cr ./heatft_mpi
hroman@node15 ~ $ ps -a
PID TTY TIME CMD
22488 pts/0 00:00:00 pbs_mom
22536 pts/0 00:00:00 bash
22631 pts/0 00:00:00 mpirun
22633 pts/0 00:00:03 heatft_mpi
22634 pts/0 00:00:03 heatft_mpi
22635 pts/0 00:00:03 heatft_mpi
22636 pts/0 00:00:03 heatft_mpi
22743 pts/1 00:00:00 ps
hroman@node15 ~ $ ompi-checkpoint 22631
--------------------------------------------------------------------------
Error: Unable to find a list of active MPIRUN processes on this machine.
This could be due to one of the following:
- The PID specified (22631) is not that of an active MPIRUN.
- The session directory location could not be found/parsed.
ompi-checkpoint attempted to find the session directory:
/tmp//openmpi-sessions-hroman@node15_0
Check to make sure that this directory exists while the MPIRUN
process is running.
Return Code: -13 (Not found)
--------------------------------------------------------------------------
I've tried it with an other application, that doesn't change anything. I also
tried to set the checkpoint directorys in $prefix/ect/openmpi-mca-params.conf
but that didn't seem to have any effect. however if i write errors in this file
(smth that is no parameter eg. "hello world") it will complain, so it seems to
read the file.
I also checked the environement variables but they seem to be ok, as far as i
can tell.
do you have an idea where the error could be?
here
http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz<http://n.ethz.ch/%7Ehroman/downloads/ompi_mailinglist.tar.gz>
(40MB) you'll find the library and the build of openmpi & blcr as well as the
env variables and the output of ompi_info. please let me know if more outputs
are needed.
cheers
roman