2010-03-05
马少杰
Dear Sir:
I want to use openmpi and blcr to checkpoint.However, I want restart the
check point
on other hosts. For example, I run mpi program using openmpi on
host1 and host2, and I save the checkpoint file at a nfs shared path.
Then I wan to restart the job (ompi-restart -machinefile ma
ompi_global_snapshot_15865.ckpt) on host3 and
host4. The 4 host have same hardware and software. If I change the hostname
(host3 and host4) on machinfile, the error always occur,
[node182:27278] *** Process received signal ***
[node182:27278] Signal: Segmentation fault (11)
[node182:27278] Signal code: Address not mapped (1)
[node182:27278] Failing at address: 0x3b81009530
[node182:27275] *** Process received signal ***
[node182:27275] Signal: Segmentation fault (11)
[node182:27275] Signal code: Address not mapped (1)
[node182:27275] Failing at address: 0x3b81009530
[node182:27274] *** Process received signal ***
[node182:27274] Signal: Segmentation fault (11)
[node182:27274] Signal code: Address not mapped (1)
[node182:27274] Failing at address: 0x3b81009530
[node182:27276] *** Process received signal ***
[node182:27276] Signal: Segmentation fault (11)
[node182:27276] Signal code: Address not mapped (1)
[node182:27276] Failing at address: 0x3b81009530
--------------------------------------------------------------------------
mpirun noticed that process rank 9 with PID 27973 on node node183 exited on
signal 11 (Segmentation fault).
if I comeback the hostname as host1 and host2, it can restart succesfully.
my openmpi version is 1.3.4
./configure --with-ft=cr --enable-mpi-threads --enable-ft-thread
--with-blcr=$dir --with-blcr-libdir=/$dir/lib --prefix=$dir_ompi
--enable-mpirun-prefix-by-default
the command run the mpi progrom as
mpirun -np 8 --am ft-enable-cr --mca opal_cr_use_thread 0 -machinefile ma ./cpi
vim $HOME/.openmpi/mca-params.conf
crs_base_snapshot_dir=/tmp/cr
snapc_base_global_snapshot_dir=/disk/cr