Hi,
First, I'm hoping setting the subject of this e-mail will get it attached to the thread that starts with this e-mail: http://www.open-mpi.org/community/lists/users/2009/12/11608.php The reason I'm not replying to that thread is that I wasn't subscribed to the list at the time. My environment is detailed in another thread, not related at all to this issue: http://www.open-mpi.org/community/lists/users/2010/03/12199.php I'm running into the same problem Jean described (though I'm running 1.4.1). Note that taking and restarting from checkpoints works fine now when I'm using only a single node. This is what I get by running the job on two nodes, also showing the output after the checkpoint is taken: root@debian1# mpirun -am ft-enable-cr -mca btl_tcp_if_include eth1 -np 2 --host debian1,debian2 ring <snip> >>> Process 1 sending 2460 to 0 >>> Process 1 received 2459 >>> Process 1 sending 2459 to 0 [debian1:01817] Error: expected_component: PID information unavailable! [debian1:01817] Error: expected_component: Component Name information unavailable! -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 1819 on node debian1 exited on signal 0 (Unknown signal 0). -------------------------------------------------------------------------- Now taking the checkpoint: root@debian1# ompi-checkpoint --term `ps ax | grep mpirun | grep -v grep | awk '{print $1}'` Snapshot Ref.: 0 ompi_global_snapshot_1817.ckpt Restarting from the checkpoint: root@debian1:~# ompi-restart ompi_global_snapshot_1817.ckpt [debian1:01832] Error: Unable to access the path [/root/ompi_global_snapshot_1817.ckpt/0/opal_snapshot_1.ckpt]! -------------------------------------------------------------------------- Error: The filename (opal_snapshot_1.ckpt) is invalid because either you have not provided a filename or provided an invalid filename. Please see --help for usage. -------------------------------------------------------------------------- After spitting that error message, ompi-restart just hangs forever. Here's something that may or may not matter. debian1 and debian2 are two virtual machines. They have two network interfaces each: - eth0: Connected through NAT so that the machine can access the internet. It gets an address by DHCP (VirtualBox magic), which is always 10.0.2.15/24 (for both machines). They have no connection to each other through this interface, they can only access the outside. - eth1: Connected to an internal VirtualBox interface. Only debian1 and debian2 are members of that internal network (more VirtualBox magic). The IPs are statically configured, 192.168.200.1/24 for debian1, 192.168.200.2/24 for debian2. Since both machines have an IP in the same subnet on eth0 (actually the same IP), OpenMPI thinks they're in the same network connected through eth0 too. That's why I need to specify btl_tcp_if_include eth1, otherwise running jobs across the two nodes will not work properly (sends and recvs time out). Is there anything I can do to provide more information about this bug? E.g. try to compile the code in the SVN trunk? I also have kept the snapshots intact, I can tar them up and upload them somewhere in case you guys need it. I can also provide the source code to the ring program, but it's really the canonical ring MPI example. As usual, any info you might need, just ask and I'll provide. Thanks in advance,