Often this type of problem is due to the 'prelink' option in Linux. BLCR has a FAQ item that discusses this issue and how to resolve it: https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink
I would give that a try. If that does not help then you might want to try checkpointing a single (non-MPI) process on one node with BLCR and restart it on the other node. If that fails, then it is likely a BLCR/system configuration issue that is the cause. If it does work, then we can dig more into the Open MPI causes. Let me know if disabling prelink works for you. -- Josh On Thu, Dec 29, 2011 at 1:19 PM, Lloyd Brown <lloyd_br...@byu.edu> wrote: > Hi, all. > > I'm in the middle of testing some of the checkpoint/restart capabilities > of OpenMPI with BLCR on our cluster. I've been able to checkpoint and > restart successfully when I restart on the same nodes as it was running > previously. But when I try to restart on a different host, I always get > an error like this: > >> $ ompi-restart ompi_global_snapshot_15935.ckpt >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 1 with PID 15201 on node m5stage-1-2.local >> exited on signal 11 (Segmentation fault). >> -------------------------------------------------------------------------- > > > Now, it's very possible that I've missed something during the setup, or > that despite my failure to find it while searching the mailing list, > that this is already answered somewhere, but none of the threads I could > find seemed to apply (eg. cr_restart *is* installed, etc.). > > I'm attaching a tarball that contains the source code of the very-simple > test application, as well as some example output of "ompi_info --all" > and "ompi_info -v ompi full --parsable". I don't know if this will be > useful or not. > > This is being tested on CentOS v5.4 with BLCR v0.8.4. I've seen this > problem with OpenMPI v1.4.2, v1.4.4, and v1.5.4. > > If anyone has any ideas on what's going on, or how to best debug this, > I'd love to hear about it. > > I don't mind doing the legwork too, but I'm just stumped where to go > from here. I have some core files, but I'm having trouble getting the > symbols from the backtrace in gdb. Maybe I'm doing it wrong. > > > TIA, > > -- > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey