Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

Nguyen Toan Mon, 24 May 2010 05:51:34 -0400

Hi all,

I had the same problem like Jitsumoto, i.e. OpenMPI 1.4.2 failed to restart
and the patch which Fernando gave didn't work.
I also tried 1.5 nightly snapshots but it seemed not working well.
For some purpose, I don't want to use --enable-ft-thread in configure but
the same error occurred even --enable-ft-thread is used.
Here is my configure for OMPI 1.5a1r23135:


>./configure \
>--with-ft=cr \
>--enable-mpi-threads \
>--with-blcr=/home/nguyen/opt/blcr
--with-blcr-libdir=/home/nguyen/opt/blcr/lib \
>--prefix=/home/nguyen/opt/openmpi_1.5 --enable-mpirun-prefix-by-default \

and errors:

>$ mpirun -am ft-enable-cr -machinefile ./host ./a.out
>0
>0
>1
>1
>2
>2
>3
>3
>----------------------------------------------------------------------
>mpirun has exited due to process rank 1 with PID 6582 on
>node rc014 exiting improperly. There are two reasons this could occur:

>1. this process did not call "init" before exiting, but others in
>the job did. This can cause a job to hang indefinitely while it waits
>for all processes to call "init". By rule, if one process calls "init",
>then ALL processes must call "init" prior to termination.

>2. this process called "init", but exited without calling "finalize".
>By rule, all processes that call "init" MUST call "finalize" prior to
>exiting or it will be considered an "abnormal termination"

>This may have caused other processes in the application to be
>terminated by signals sent by mpirun (as reported here).
>-----------------------------------------------------------------------

And here is the checkpoint command:

>$ ompi-checkpoint -s -v --term 10982
>[rc013.local:11001] [  0.00 /   0.14]                 Requested - ...
>[rc013.local:11001] [  0.00 /   0.14]                   Pending - ...
>[rc013.local:11001] [  0.01 /   0.15]                   Running - ...
>[rc013.local:11001] [  7.79 /   7.94]                  Finished -
>ompi_global_snapshot_10982.ckpt
>Snapshot Ref.:   0 ompi_global_snapshot_10982.ckpt

I also took a look inside the checkpoint files and found that the snapshot
was
taken: 
~/tmp/ckpt/ompi_global_snapshot_10982.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6582

But restarting failed as follows:
>$ ompi-restart ompi_global_snapshot_10982.ckpt
>--------------------------------------------------------------------------
>mpirun noticed that process rank 1 with PID 11346 on node rc013.local
exited >on signal 11 (Segmentation fault).
>--------------------------------------------------------------------------

Is there any idea about this? Thank you!

Regards,
Nguyen Toan


On Mon, May 24, 2010 at 4:08 PM, Hideyuki Jitsumoto <
jitum...@gsic.titech.ac.jp> wrote:

> ---------- Forwarded message ----------
> From: Fernando Lemos <fernando...@gmail.com>
> Date: Thu, Apr 15, 2010 at 2:18 AM
> Subject: Re: [OMPI users] OpenMPI Checkpoint/Restart is failed
> To: Open MPI Users <us...@open-mpi.org>
>
>
> On Wed, Apr 14, 2010 at 5:25 AM, Hideyuki Jitsumoto
> <hjitsum...@gmail.com> wrote:
> > Fernando,
> >
> > Thank you for your reply.
> > I tried to patch the file you mentioned, but the output did not change.
>
> I didn't test the patch, tbh. I'm using 1.5 nightly snapshots, and it
> works great.
>
> >>Are you using a shared file system? You need to use a shared file
> > system for checkpointing with 1.4.1:
> > What is the shared file system ? do you mean NFS, Lustre and so on ?
> > (I'm sorry about my ignorance...)
>
> Something like NFS, yea.
>
> > If I use only one node for application, do I need such a
> shared-file-system ?
>
> No, for a single node, checkpointing with 1.4.1 should work (it works
> for me, at least). If you're using a single node, then your problem is
> probably not related to the bug report I posted.
>
>
> Regards,
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> Sincerely Yours,
> Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp)
> Tokyo Institute of Technology
> Global Scientific Information and Computing center (Matsuoka Lab.)
>

Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

Reply via email to