checkpointing and restarting openmpi applications don't work for me. I have a redhat version 5U6 system with blcr checkpointing version 0.8.4 and openmpi version 1.6.3.
I have a simple parallel application that I want to checkpoint and restart. I see that the blcr modules are loaded (with lsmod). I run: mpirun -np 1 -hostfile hostfile -am ft-enable-cr EXECUTABLE ompi-checkpoint -v -s <PID of mpirun> then I kill mpirun. then: ompi-restart -v ompi_global_snapshot_<PID>.ckpt here is my results: Error: Unable to obtain the proper restart command to restart from the checkpoint file (opal_snapshot_0.ckpt). Returned -1. Check the installation of the none checkpoint/restart service on all of the machines in your system. If I try using the blcr utilities (cr_run, cr_checkpoint, cr_run) then it runs on the local machine, it won't on more then one machine. Please help me with this. Thank you. With Blessings, always, Jerry Mersel System Administrator IT Infrastructure Branch | Division of Information Systems Weizmann Institute of Science Rehovot 76100, Israel Tel: +972-8-9342363