(Sorry for the delay. I have been on travel, and just now getting
caught up on email.)
It looks like the checkpoint is corrupted. This can be caused by a
number of things. Usually it is caused by memory corruption in the
application that then further muddles the checkpoint generated. Are
you able to get a stack trace from the core dump resulting from the
segfault on restart?
What do you mean by the checkpoint "hangs forever just before ending"?
Do you have to CTRL-C the application, or is the checkpoint just
taking a long time to finish?
-- Josh
On Jun 15, 2009, at 11:30 AM, Kritiraj Sajadah wrote:
Dear All,
I have installed BLCR 0.8.1 and OPENMPI 1.3 on a linux
platform. However, when i tried checkpoiting an application, it
hangs forever just before ending.
A chekcpoint file is generated. However, when i try restarting it, i
get the following error:
raj@sun06:~$ ompi-restart ompi_global_snapshot_22390.ckpt
[sun06:22423] *** Process received signal ***
[sun06:22423] Signal: Segmentation fault (11)
[sun06:22423] Signal code: Address not mapped (1)
[sun06:22423] Failing at address: (nil)
[sun06:22423] [ 0] [0xb7fb640c]
[sun06:22423] [ 1] /usr/local/openmpi/lib/libopen-pal.so.
0(opal_crs_blcr_restart+0x103) [0xb7f76925]
[sun06:22423] [ 2] opal-restart [0x8049435]
[sun06:22423] [ 3] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7d9a455]
[sun06:22423] [ 4] opal-restart [0x8049001]
[sun06:22423] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 22423 on node sun06
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Any help will be very appreciated.
kind regards,
Raj
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users