Josh,
I'm responding to some outstanding questions about the env. I'm trying to 
ompi-restart in.
My answers to your questions are sprinkled below, and include a few more 
questions based on attempts I've made to get a multi-node restart working.

thanks,
Sharon

Sharon Brunett wrote:
Josh Hursey wrote:
On Apr 23, 2008, at 4:04 PM, Sharon Brunett wrote:

Hello,
I'm using openmpi-1.3a1r18241 on a 2 node configuration and having troubles with the ompi-restart. I can successfully ompi-checkpoint and ompi-restart a 1 way mpi code.
When I try a 2 way job running across 2 nodes, I get

bash-2.05b$ ompi-restart -verbose ompi_global_snapshot_926.ckpt
[shc005:01159] Checking for the existence of (/home/sharon/ ompi_global_snapshot_926.ckpt)
[shc005:01159] Restarting from file (ompi_global_snapshot_926.ckpt)
[shc005:01159]   Exec in self
Restart failed: Permission denied
Restart failed: Permission denied

This error is coming from BLCR. A few things to check.

First take a look at /var/log/messages on the machine(s) you are trying to restart on. Per:
  http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#eperm

Next check to make sure prelinking is turned off on the two machines you are using. Per:
  http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink

Those will rule out some common BLCR problems. (more below)

If I try running as root, using the same snapshot file, the code restarts ok, but both tasks and up on the same node, rather than one per node (like the original mpirun).
You should never have to run as root to restart a process (or to run Open MPI in any form). So I'm wondering if your user has permissions to access the checkpoint files that BLCR is generating. You can look at the permissions for the individual checkpoint files by looking into the checkpoint handler directory. They are a bit hidden, so something like the following should expose them:
-------------------
shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/ opal_snapshot_0.ckpt/
total 1756
drwx------  2 sharon users    4096 Apr 23 16:29 .
drwx------  4 sharon users    4096 Apr 23 16:29 ..
-rw-------  1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.31849
-rw-r--r--  1 sharon users      35 Apr 23 16:29 snapshot_meta.data
shell$
shell$ ls -la /home/sharon/ompi_global_snapshot_926.ckpt/0/ opal_snapshot_1.ckpt/
total 1756
drwx------  2 sharon users    4096 Apr 23 16:29 .
drwx------  4 sharon users    4096 Apr 23 16:29 ..
-rw-------  1 sharon users 1780180 Apr 23 16:29 ompi_blcr_context.31850
-rw-r--r--  1 sharon users      35 Apr 23 16:29 snapshot_meta.data
-------------------

The BLCR generated context files are "ompi_blcr_context.PID", and you need to check to make sure that you have sufficient permissions to access to those files (something like above).

I'm using BLCR version 0.6.5.
I generate checkpoints via 'ompi-checkpoint pid'
where pid is the pid of the mpirun task below

mpirun -np 2 -am ft-enable-cr ./xhpl

Are you running in a managed environment (e.g., using Torque or Slurm)? Odds are once you switched to root you lost your environmental symbols for your allocation (which is how Open MPI detects when to use an allocation). This would explain why the processes were restarted on one node instead of two.

Maui/torque is the scheduler/resource manager combo being used. I have been 
trying, to no avail, to push a machinefile (listing the hostnames of the nodes 
given to me by maui/torque) at ompi-restart which can in turn pass this on to 
mpirun. Any suggestions on how to do this? --verbose passed to ompi-restart 
isn't very verbose about what's going on.


ompi-restart uses mpirun underneath to do the process launch in exactly the same way the normal mpirun. So the mapping of processes should be the same. That being said there is a bug that I'm tracking in which they are not. This bug has nothing to do with restarting processes, and more with a bookkeeping error when using app files.


Right, I doubt the bug has anything to do with my basic problems of not 
launching the mpi tasks across 2 nodes rather than just the node mpirun is 
sitting on.

Thanks,
Sharon

Reply via email to