Hi, all. I'm in the middle of testing some of the checkpoint/restart capabilities of OpenMPI with BLCR on our cluster. I've been able to checkpoint and restart successfully when I restart on the same nodes as it was running previously. But when I try to restart on a different host, I always get an error like this:
> $ ompi-restart ompi_global_snapshot_15935.ckpt > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 15201 on node m5stage-1-2.local > exited on signal 11 (Segmentation fault). > -------------------------------------------------------------------------- Now, it's very possible that I've missed something during the setup, or that despite my failure to find it while searching the mailing list, that this is already answered somewhere, but none of the threads I could find seemed to apply (eg. cr_restart *is* installed, etc.). I'm attaching a tarball that contains the source code of the very-simple test application, as well as some example output of "ompi_info --all" and "ompi_info -v ompi full --parsable". I don't know if this will be useful or not. This is being tested on CentOS v5.4 with BLCR v0.8.4. I've seen this problem with OpenMPI v1.4.2, v1.4.4, and v1.5.4. If anyone has any ideas on what's going on, or how to best debug this, I'd love to hear about it. I don't mind doing the legwork too, but I'm just stumped where to go from here. I have some core files, but I'm having trouble getting the symbols from the backtrace in gdb. Maybe I'm doing it wrong. TIA, -- Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu
byufsl_debugging_segfault_on_resume.tar.gz
Description: application/gzip