Hi all, I had the same problem like Jitsumoto, i.e. OpenMPI 1.4.2 failed to restart and the patch which Fernando gave didn't work. I also tried 1.5 nightly snapshots but it seemed not working well. For some purpose, I don't want to use --enable-ft-thread in configure but the same error occurred even --enable-ft-thread is used. Here is my configure for OMPI 1.5a1r23135:
>./configure \ >--with-ft=cr \ >--enable-mpi-threads \ >--with-blcr=/home/nguyen/opt/blcr --with-blcr-libdir=/home/nguyen/opt/blcr/lib \ >--prefix=/home/nguyen/opt/openmpi_1.5 --enable-mpirun-prefix-by-default \ and errors: >$ mpirun -am ft-enable-cr -machinefile ./host ./a.out >0 >0 >1 >1 >2 >2 >3 >3 >---------------------------------------------------------------------- >mpirun has exited due to process rank 1 with PID 6582 on >node rc014 exiting improperly. There are two reasons this could occur: >1. this process did not call "init" before exiting, but others in >the job did. This can cause a job to hang indefinitely while it waits >for all processes to call "init". By rule, if one process calls "init", >then ALL processes must call "init" prior to termination. >2. this process called "init", but exited without calling "finalize". >By rule, all processes that call "init" MUST call "finalize" prior to >exiting or it will be considered an "abnormal termination" >This may have caused other processes in the application to be >terminated by signals sent by mpirun (as reported here). >----------------------------------------------------------------------- And here is the checkpoint command: >$ ompi-checkpoint -s -v --term 10982 >[rc013.local:11001] [ 0.00 / 0.14] Requested - ... >[rc013.local:11001] [ 0.00 / 0.14] Pending - ... >[rc013.local:11001] [ 0.01 / 0.15] Running - ... >[rc013.local:11001] [ 7.79 / 7.94] Finished - >ompi_global_snapshot_10982.ckpt >Snapshot Ref.: 0 ompi_global_snapshot_10982.ckpt I also took a look inside the checkpoint files and found that the snapshot was taken: ~/tmp/ckpt/ompi_global_snapshot_10982.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6582 But restarting failed as follows: >$ ompi-restart ompi_global_snapshot_10982.ckpt >-------------------------------------------------------------------------- >mpirun noticed that process rank 1 with PID 11346 on node rc013.local exited >on signal 11 (Segmentation fault). >-------------------------------------------------------------------------- Is there any idea about this? Thank you! Regards, Nguyen Toan On Mon, May 24, 2010 at 4:08 PM, Hideyuki Jitsumoto < jitum...@gsic.titech.ac.jp> wrote: > ---------- Forwarded message ---------- > From: Fernando Lemos <fernando...@gmail.com> > Date: Thu, Apr 15, 2010 at 2:18 AM > Subject: Re: [OMPI users] OpenMPI Checkpoint/Restart is failed > To: Open MPI Users <us...@open-mpi.org> > > > On Wed, Apr 14, 2010 at 5:25 AM, Hideyuki Jitsumoto > <hjitsum...@gmail.com> wrote: > > Fernando, > > > > Thank you for your reply. > > I tried to patch the file you mentioned, but the output did not change. > > I didn't test the patch, tbh. I'm using 1.5 nightly snapshots, and it > works great. > > >>Are you using a shared file system? You need to use a shared file > > system for checkpointing with 1.4.1: > > What is the shared file system ? do you mean NFS, Lustre and so on ? > > (I'm sorry about my ignorance...) > > Something like NFS, yea. > > > If I use only one node for application, do I need such a > shared-file-system ? > > No, for a single node, checkpointing with 1.4.1 should work (it works > for me, at least). If you're using a single node, then your problem is > probably not related to the bug report I posted. > > > Regards, > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > Sincerely Yours, > Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp) > Tokyo Institute of Technology > Global Scientific Information and Computing center (Matsuoka Lab.) >