Re: [OMPI users] checkpointing multi node and multi process applications

Fernando Lemos Thu, 4 Mar 2010 08:17:50 -0500

On Wed, Mar 3, 2010 at 10:24 PM, Fernando Lemos <fernando...@gmail.com> wrote:
<snip>
> Is there anything I can do to provide more information about this bug?
> E.g. try to compile the code in the SVN trunk? I also have kept the
> snapshots intact, I can tar them up and upload them somewhere in case
> you guys need it. I can also provide the source code to the ring
> program, but it's really the canonical ring MPI example.
>


I tried 1.5 (1.5a1r22754 nightly snapshot, same compilation flags).
This time taking the checkpoint didn't generate any error message:

root@debian1:~# mpirun -am ft-enable-cr -mca btl_tcp_if_include eth1
-np 2 --host debian1,debian2 ring
<snip>
>>> Process 1 sending 2761 to 0
>>> Process 1 received 2760
>>> Process 1 sending 2760 to 0
root@debian1:~#

But restoring it did:

root@debian1:~# ompi-restart ompi_global_snapshot_23071.ckpt
[debian1:23129] Error: Unable to access the path
[/root/ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_1.ckpt) is invalid because either
you have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 23129 on
node debian1 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
root@debian1:~#

Indeed, opal_snapshot_1.ckpt does not exist exist:

root@debian1:~# find ompi_global_snapshot_23071.ckpt/
ompi_global_snapshot_23071.ckpt/
ompi_global_snapshot_23071.ckpt/global_snapshot_meta.data
ompi_global_snapshot_23071.ckpt/restart-appfile
ompi_global_snapshot_23071.ckpt/0
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/ompi_blcr_context.23073
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data
root@debian1:~#

It can be found in debian2:

root@debian2:~# find ompi_global_snapshot_23071.ckpt/
ompi_global_snapshot_23071.ckpt/
ompi_global_snapshot_23071.ckpt/0
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/snapshot_meta.data
ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6501
root@debian2:~#

Then I tried supplying a hostfile for ompi-run and it worked just
fine! I thought the checkpoint included the hosts information?

So I think it's fixed in 1.5. Should I try the 1.4 branch in SVN?


Thanks a bunch,

Re: [OMPI users] checkpointing multi node and multi process applications

Reply via email to