Hello ! I had some problems . My environment BLCR= 0.8.4 , openMPI= 1.5.5 , OS= ubuntu 11.04 I have 2 Node : cuda05(Master ,it have NFS file system) , cuda07(slave ,mount Master)
I had also set ~/.openmpi/mca-params.conf-> crs_base_snapshot_dir=/root/kidd_openMPI/Tmp snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints my configure format=./configure --prefix=/root/kidd_openMPI --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/BLCR --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default --enable-static --enable-shared --enable-opal-multi-threads; problem 1: ompi-restart on multiple Node command 01: mpirun -hostfile Hosts -am ft-enable-cr -x LD_LIBRARY_PATH -np 2 ./TEST command 02: ompi-restart ompi_global_snapshot_2892.ckpt -> I can checkpoint 2 process on multiples nodes ,but when I restart ,it can only restart on Master-Node. command 03 : ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt ->Error Message . I make sure BLCR is OK. ################################################################################################ -------------------------------------------------------------------------- root@cuda05:~/kidd_openMPI/checkpoints# ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt/ -------------------------------------------------------------------------- Error: BLCR was not able to restart the process because exec failed. Check the installation of BLCR on all of the machines in your system. The following information may be of help: Return Code : -1 BLCR Restart Command : cr_restart Restart Command Line : cr_restart /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.2704 -------------------------------------------------------------------------- -------------------------------------------------------------------------- Error: Unable to obtain the proper restart command to restart from the checkpoint file (opal_snapshot_1.ckpt). Returned -1. Check the installation of the blcr checkpoint/restart service on all of the machines in your system.essage #################################################################################################### problem 2: ompi-migrate i can't find . How to use ompi-migrate ? Please help me , thanks .