Hi,

Am 08.03.2012 um 19:02 schrieb Linton, Tom:

> We have a legacy application that runs fine on our cluster using Intel MPI. 
> We ported it to OpenMPI so that we could use BLCR and it runs fine but 
> checkpointing is not working properly:
>  
> 1. when we checkpoint with more than 1 core, it executes with the error:
>            mpirun noticed that process rank 1 with PID 8260 on node 
> tscco28017 exited on signal 4 (Illegal instruction).

was the appication and Open MPI compiled on one and the same machine and the 
cpu type is the same across the involved nodes?

-- Reuti


> 2. checkpointing with 1 core works
> 3. we have a simple test program that exercises MPI with multiple cores and 
> it checkpoints fine on multiple cores
>  
> Can you suggest how to debug this problem?
>  
> Some additional information:
>  
> 1. I execute the program like this: mpirun -am ft-enable-cr -n 2 -machinefile 
> machines program inputfile
> 2. when I checkpoint it, I see that the checkpoint directories are created 
> but the file “global_snapshot_meta.data” is not complete, there is no 
> restart-appfile, the “snapshot_meta.data” files are not complete, and there 
> are no dump files for the individual processes.
> 3. the command “ompi-checkpoint” doesn’t return; I have to control-C to kill 
> it after checkpointing.
> 4. We are using Open MPI 1.4.4 with BLCR 0.8.4
> 5. I have attached “ompi_info.txt”
>  
> Thanks
> Tom
>  
> <ompi_info.txt>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to