On Wed, Mar 3, 2010 at 10:24 PM, Fernando Lemos <fernando...@gmail.com> wrote: <snip> > Is there anything I can do to provide more information about this bug? > E.g. try to compile the code in the SVN trunk? I also have kept the > snapshots intact, I can tar them up and upload them somewhere in case > you guys need it. I can also provide the source code to the ring > program, but it's really the canonical ring MPI example. >
I tried 1.5 (1.5a1r22754 nightly snapshot, same compilation flags). This time taking the checkpoint didn't generate any error message: root@debian1:~# mpirun -am ft-enable-cr -mca btl_tcp_if_include eth1 -np 2 --host debian1,debian2 ring <snip> >>> Process 1 sending 2761 to 0 >>> Process 1 received 2760 >>> Process 1 sending 2760 to 0 root@debian1:~# But restoring it did: root@debian1:~# ompi-restart ompi_global_snapshot_23071.ckpt [debian1:23129] Error: Unable to access the path [/root/ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt]! -------------------------------------------------------------------------- Error: The filename (opal_snapshot_1.ckpt) is invalid because either you have not provided a filename or provided an invalid filename. Please see --help for usage. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun has exited due to process rank 1 with PID 23129 on node debian1 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- root@debian1:~# Indeed, opal_snapshot_1.ckpt does not exist exist: root@debian1:~# find ompi_global_snapshot_23071.ckpt/ ompi_global_snapshot_23071.ckpt/ ompi_global_snapshot_23071.ckpt/global_snapshot_meta.data ompi_global_snapshot_23071.ckpt/restart-appfile ompi_global_snapshot_23071.ckpt/0 ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/ompi_blcr_context.23073 ompi_global_snapshot_23071.ckpt/0/opal_snapshot_0.ckpt/snapshot_meta.data root@debian1:~# It can be found in debian2: root@debian2:~# find ompi_global_snapshot_23071.ckpt/ ompi_global_snapshot_23071.ckpt/ ompi_global_snapshot_23071.ckpt/0 ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/snapshot_meta.data ompi_global_snapshot_23071.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6501 root@debian2:~# Then I tried supplying a hostfile for ompi-run and it worked just fine! I thought the checkpoint included the hosts information? So I think it's fixed in 1.5. Should I try the 1.4 branch in SVN? Thanks a bunch,