(Sorry for the delay. I have been on travel, and just now getting caught up on email.)

It looks like the checkpoint is corrupted. This can be caused by a number of things. Usually it is caused by memory corruption in the application that then further muddles the checkpoint generated. Are you able to get a stack trace from the core dump resulting from the segfault on restart?

What do you mean by the checkpoint "hangs forever just before ending"? Do you have to CTRL-C the application, or is the checkpoint just taking a long time to finish?

-- Josh

On Jun 15, 2009, at 11:30 AM, Kritiraj Sajadah wrote:


Dear All,
I have installed BLCR 0.8.1 and OPENMPI 1.3 on a linux platform. However, when i tried checkpoiting an application, it hangs forever just before ending.

A chekcpoint file is generated. However, when i try restarting it, i get the following error:

raj@sun06:~$ ompi-restart ompi_global_snapshot_22390.ckpt
[sun06:22423] *** Process received signal ***
[sun06:22423] Signal: Segmentation fault (11)
[sun06:22423] Signal code: Address not mapped (1)
[sun06:22423] Failing at address: (nil)
[sun06:22423] [ 0] [0xb7fb640c]
[sun06:22423] [ 1] /usr/local/openmpi/lib/libopen-pal.so. 0(opal_crs_blcr_restart+0x103) [0xb7f76925]
[sun06:22423] [ 2] opal-restart [0x8049435]
[sun06:22423] [ 3] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7d9a455]
[sun06:22423] [ 4] opal-restart [0x8049001]
[sun06:22423] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 22423 on node sun06 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Any help will be very appreciated.

kind regards,

Raj



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to