Hi

I noticed that the directory  /tmp/openmpi-sessions-hroman@cbl1_0 is created on 
the login nodes but not on the compute nodes. By setting orte_tmpdir_base=/tmp 
in \$prefix/ect/openmpi-mca-params.conf i could make sure that the session 
directory is created. 

But when i now try to checkpoint an application i'll get:

shell1:
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 11175 on node node7 exited on 
signal 10 (User defined signal 1).
--------------------------------------------------------------------------

shell2:
hroman@node7 ~/semesterthesis/code/code1_heat1d $ ps -a
  PID TTY          TIME CMD
 9713 pts/0    00:00:02 pbs_mom
 9761 pts/0    00:00:00 bash
11170 pts/0    00:00:00 mpirun
11175 pts/0    00:00:06 heatft_mpi
11178 pts/1    00:00:00 ps
hroman@node7 ~/semesterthesis/code/code1_heat1d $ ompi-checkpoint -v 11170
[node7:11184] [  0.00 /   0.01]                 Requested - ...
[node7:11184] [  0.00 /   0.01]                   Pending - ...

which never returns. and does not seem to do anything.

do you have an idea what to try or do, to make it work?

cheers
roman


________________________________________
Von: users-boun...@open-mpi.org [users-boun...@open-mpi.org]" im Auftrag 
von "Hellmüller  Roman [hro...@student.ethz.ch]
Gesendet: Mittwoch, 30. März 2011 16:33
Bis: us...@open-mpi.org
Betreff: [OMPI users] Fault tolerant ompi - Error: Unable to find a list of 
active MPIRUN processes on this machine.

Hi

I'm trying to get fault tolerant ompi running on our cluster for my 
semesterthesis.

On the login node i was successful, checkpointing works.
Since the compute nodes have different kernels, i had to compile blcr on the 
compute nodes again.  blcr on the compute nodes works. after that i installed 
openmpi (1.5.3) on the compute nodes. Letting a normal mpi program run works. 
also letting it run with -am ft-enable-cr  works, but as soon as i would like 
to take a checkpoint it crashes:

hroman@node15 ~/semesterthesis/code/code1_heat1d $ mpirun -np 4 -am 
ft-enable-cr ./heatft_mpi

hroman@node15 ~ $ ps -a
  PID TTY          TIME CMD
22488 pts/0    00:00:00 pbs_mom
22536 pts/0    00:00:00 bash
22631 pts/0    00:00:00 mpirun
22633 pts/0    00:00:03 heatft_mpi
22634 pts/0    00:00:03 heatft_mpi
22635 pts/0    00:00:03 heatft_mpi
22636 pts/0    00:00:03 heatft_mpi
22743 pts/1    00:00:00 ps

hroman@node15 ~ $ ompi-checkpoint 22631
--------------------------------------------------------------------------
Error: Unable to find a list of active MPIRUN processes on this machine.
       This could be due to one of the following:
        - The PID specified (22631) is not that of an active MPIRUN.
        - The session directory location could not be found/parsed.

       ompi-checkpoint attempted to find the session directory:
         /tmp//openmpi-sessions-hroman@node15_0
       Check to make sure that this directory exists while the MPIRUN
       process is running.

       Return Code: -13 (Not found)

--------------------------------------------------------------------------

I've tried it with an other application, that doesn't change anything. I also 
tried to set the checkpoint directorys in $prefix/ect/openmpi-mca-params.conf 
but that didn't seem to have any effect. however if i write errors in this file 
(smth that is no parameter eg. "hello world") it will complain, so it seems to 
read the file.
I also checked the environement variables but they seem to be ok, as far as i 
can tell.

do you have an idea where the error could be?

here 
http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz<http://n.ethz.ch/%7Ehroman/downloads/ompi_mailinglist.tar.gz>
 (40MB) you'll find the library and the build of openmpi & blcr as well as the 
env variables and the output of ompi_info. please let me know if more outputs 
are needed.

cheers
roman

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to