Hi I noticed that the directory /tmp/openmpi-sessions-hroman@cbl1_0 is created on the login nodes but not on the compute nodes. By setting orte_tmpdir_base=/tmp in \$prefix/ect/openmpi-mca-params.conf i could make sure that the session directory is created.
But when i now try to checkpoint an application i'll get: shell1: -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 11175 on node node7 exited on signal 10 (User defined signal 1). -------------------------------------------------------------------------- shell2: hroman@node7 ~/semesterthesis/code/code1_heat1d $ ps -a PID TTY TIME CMD 9713 pts/0 00:00:02 pbs_mom 9761 pts/0 00:00:00 bash 11170 pts/0 00:00:00 mpirun 11175 pts/0 00:00:06 heatft_mpi 11178 pts/1 00:00:00 ps hroman@node7 ~/semesterthesis/code/code1_heat1d $ ompi-checkpoint -v 11170 [node7:11184] [ 0.00 / 0.01] Requested - ... [node7:11184] [ 0.00 / 0.01] Pending - ... which never returns. and does not seem to do anything. do you have an idea what to try or do, to make it work? cheers roman ________________________________________ Von: users-boun...@open-mpi.org [users-boun...@open-mpi.org]" im Auftrag von "Hellmüller Roman [hro...@student.ethz.ch] Gesendet: Mittwoch, 30. März 2011 16:33 Bis: us...@open-mpi.org Betreff: [OMPI users] Fault tolerant ompi - Error: Unable to find a list of active MPIRUN processes on this machine. Hi I'm trying to get fault tolerant ompi running on our cluster for my semesterthesis. On the login node i was successful, checkpointing works. Since the compute nodes have different kernels, i had to compile blcr on the compute nodes again. blcr on the compute nodes works. after that i installed openmpi (1.5.3) on the compute nodes. Letting a normal mpi program run works. also letting it run with -am ft-enable-cr works, but as soon as i would like to take a checkpoint it crashes: hroman@node15 ~/semesterthesis/code/code1_heat1d $ mpirun -np 4 -am ft-enable-cr ./heatft_mpi hroman@node15 ~ $ ps -a PID TTY TIME CMD 22488 pts/0 00:00:00 pbs_mom 22536 pts/0 00:00:00 bash 22631 pts/0 00:00:00 mpirun 22633 pts/0 00:00:03 heatft_mpi 22634 pts/0 00:00:03 heatft_mpi 22635 pts/0 00:00:03 heatft_mpi 22636 pts/0 00:00:03 heatft_mpi 22743 pts/1 00:00:00 ps hroman@node15 ~ $ ompi-checkpoint 22631 -------------------------------------------------------------------------- Error: Unable to find a list of active MPIRUN processes on this machine. This could be due to one of the following: - The PID specified (22631) is not that of an active MPIRUN. - The session directory location could not be found/parsed. ompi-checkpoint attempted to find the session directory: /tmp//openmpi-sessions-hroman@node15_0 Check to make sure that this directory exists while the MPIRUN process is running. Return Code: -13 (Not found) -------------------------------------------------------------------------- I've tried it with an other application, that doesn't change anything. I also tried to set the checkpoint directorys in $prefix/ect/openmpi-mca-params.conf but that didn't seem to have any effect. however if i write errors in this file (smth that is no parameter eg. "hello world") it will complain, so it seems to read the file. I also checked the environement variables but they seem to be ok, as far as i can tell. do you have an idea where the error could be? here http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz<http://n.ethz.ch/%7Ehroman/downloads/ompi_mailinglist.tar.gz> (40MB) you'll find the library and the build of openmpi & blcr as well as the env variables and the output of ompi_info. please let me know if more outputs are needed. cheers roman _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users