2010-03-05 



马少杰 



Dear Sir:
   I want to use openmpi and blcr to checkpoint.However, I want restart the 
check point
on other hosts.  For example, I run mpi program using openmpi on
host1 and host2, and I save the checkpoint file at a nfs shared path.
Then I wan to restart the job (ompi-restart -machinefile ma 
ompi_global_snapshot_15865.ckpt) on host3 and
 host4. The 4 host have same hardware and software. If I change the hostname 
(host3 and host4) on machinfile, the error always  occur,
 [node182:27278] *** Process received signal ***
[node182:27278] Signal: Segmentation fault (11)
[node182:27278] Signal code: Address not mapped (1)
[node182:27278] Failing at address: 0x3b81009530
[node182:27275] *** Process received signal ***
[node182:27275] Signal: Segmentation fault (11)
[node182:27275] Signal code: Address not mapped (1)
[node182:27275] Failing at address: 0x3b81009530
[node182:27274] *** Process received signal ***
[node182:27274] Signal: Segmentation fault (11)
[node182:27274] Signal code: Address not mapped (1)
[node182:27274] Failing at address: 0x3b81009530
[node182:27276] *** Process received signal ***
[node182:27276] Signal: Segmentation fault (11)
[node182:27276] Signal code: Address not mapped (1)
[node182:27276] Failing at address: 0x3b81009530
--------------------------------------------------------------------------
mpirun noticed that process rank 9 with PID 27973 on node node183 exited on 
signal 11 (Segmentation fault).

  if I comeback the hostname as host1 and host2, it can restart succesfully.

 my openmpi version is 1.3.4
 ./configure  --with-ft=cr --enable-mpi-threads --enable-ft-thread 
--with-blcr=$dir --with-blcr-libdir=/$dir/lib --prefix=$dir_ompi 
--enable-mpirun-prefix-by-default

 the command run the mpi progrom as 
mpirun -np 8 --am ft-enable-cr --mca opal_cr_use_thread 0  -machinefile ma ./cpi

vim $HOME/.openmpi/mca-params.conf
crs_base_snapshot_dir=/tmp/cr
snapc_base_global_snapshot_dir=/disk/cr

Reply via email to