Josh, When I use cr_{run,checkpoint,restart} to start a checkpoint and restart a single-threaded, single-process app on a different host, it works, even with prelinking enabled. That's kinda why I assumed the problem was with the OpenMPI code, and didn't look at the BLCR FAQ that closely, to be honest.
Having said that, I did temporarily disable prelink on my two hosts, and tried my MPI test again, and it seemed to work. I'll have to do more tests with something more intense (xhpl, maybe), and so on, but preliminary results look good. Thanks for pointing me in the right direction. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 12/29/2011 02:31 PM, Josh Hursey wrote: > Often this type of problem is due to the 'prelink' option in Linux. > BLCR has a FAQ item that discusses this issue and how to resolve it: > https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink > > I would give that a try. If that does not help then you might want to > try checkpointing a single (non-MPI) process on one node with BLCR and > restart it on the other node. If that fails, then it is likely a > BLCR/system configuration issue that is the cause. If it does work, > then we can dig more into the Open MPI causes. > > Let me know if disabling prelink works for you. > > -- Josh >