Hi, all.

I'm in the middle of testing some of the checkpoint/restart capabilities
of OpenMPI with BLCR on our cluster.  I've been able to checkpoint and
restart successfully when I restart on the same nodes as it was running
previously.  But when I try to restart on a different host, I always get
an error like this:

> $ ompi-restart ompi_global_snapshot_15935.ckpt
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 15201 on node m5stage-1-2.local 
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------


Now, it's very possible that I've missed something during the setup, or
that despite my failure to find it while searching the mailing list,
that this is already answered somewhere, but none of the threads I could
find seemed to apply (eg. cr_restart *is* installed, etc.).

I'm attaching a tarball that contains the source code of the very-simple
test application, as well as some example output of "ompi_info --all"
and "ompi_info -v ompi full --parsable".  I don't know if this will be
useful or not.

This is being tested on CentOS v5.4 with BLCR v0.8.4.  I've seen this
problem with OpenMPI v1.4.2, v1.4.4, and v1.5.4.

If anyone has any ideas on what's going on, or how to best debug this,
I'd love to hear about it.

I don't mind doing the legwork too, but I'm just stumped where to go
from here.  I have some core files, but I'm having trouble getting the
symbols from the backtrace in gdb.  Maybe I'm doing it wrong.


TIA,

-- 
Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

Attachment: byufsl_debugging_segfault_on_resume.tar.gz
Description: application/gzip

Reply via email to