[OMPI users] checkpointing multi node and multi process applications

Fernando Lemos Wed, 3 Mar 2010 20:25:00 -0500

Hi,


First, I'm hoping setting the subject of this e-mail will get it
attached to the thread that starts with this e-mail:

http://www.open-mpi.org/community/lists/users/2009/12/11608.php

The reason I'm not replying to that thread is that I wasn't subscribed
to the list at the time.


My environment is detailed in another thread, not related at all to this issue:

http://www.open-mpi.org/community/lists/users/2010/03/12199.php


I'm running into the same problem Jean described (though I'm running
1.4.1). Note that taking and restarting from checkpoints works fine
now when I'm using only a single node.

This is what I get by running the job on two nodes, also showing the
output after the checkpoint is taken:

root@debian1# mpirun -am ft-enable-cr -mca btl_tcp_if_include eth1 -np
2 --host debian1,debian2 ring
<snip>
>>> Process 1 sending 2460 to 0
>>> Process 1 received 2459
>>> Process 1 sending 2459 to 0
[debian1:01817] Error: expected_component: PID information unavailable!
[debian1:01817] Error: expected_component: Component Name information
unavailable!
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 1819 on node debian1
exited on signal 0 (Unknown signal 0).
--------------------------------------------------------------------------

Now taking the checkpoint:

root@debian1# ompi-checkpoint --term `ps ax | grep mpirun | grep -v
grep | awk '{print $1}'`
Snapshot Ref.:   0 ompi_global_snapshot_1817.ckpt

Restarting from the checkpoint:

root@debian1:~# ompi-restart ompi_global_snapshot_1817.ckpt
[debian1:01832] Error: Unable to access the path
[/root/ompi_global_snapshot_1817.ckpt/0/opal_snapshot_1.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_1.ckpt) is invalid because either
you have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------

After spitting that error message, ompi-restart just hangs forever.


Here's something that may or may not matter. debian1 and debian2 are
two virtual machines. They have two network interfaces each:

- eth0: Connected through NAT so that the machine can access the
internet. It gets an address by DHCP (VirtualBox magic), which is
always 10.0.2.15/24 (for both machines). They have no connection to
each other through this interface, they can only access the outside.

- eth1: Connected to an internal VirtualBox interface. Only debian1
and debian2 are members of that internal network (more VirtualBox
magic). The IPs are statically configured, 192.168.200.1/24 for
debian1, 192.168.200.2/24 for debian2.

Since both machines have an IP in the same subnet on eth0 (actually
the same IP), OpenMPI thinks they're in the same network connected
through eth0 too. That's why I need to specify btl_tcp_if_include
eth1, otherwise running jobs across the two nodes will not work
properly (sends and recvs time out).


Is there anything I can do to provide more information about this bug?
E.g. try to compile the code in the SVN trunk? I also have kept the
snapshots intact, I can tar them up and upload them somewhere in case
you guys need it. I can also provide the source code to the ring
program, but it's really the canonical ring MPI example.

As usual, any info you might need, just ask and I'll provide.


Thanks in advance,

[OMPI users] checkpointing multi node and multi process applications

Reply via email to