Hi, Am 08.03.2012 um 19:02 schrieb Linton, Tom:
> We have a legacy application that runs fine on our cluster using Intel MPI. > We ported it to OpenMPI so that we could use BLCR and it runs fine but > checkpointing is not working properly: > > 1. when we checkpoint with more than 1 core, it executes with the error: > mpirun noticed that process rank 1 with PID 8260 on node > tscco28017 exited on signal 4 (Illegal instruction). was the appication and Open MPI compiled on one and the same machine and the cpu type is the same across the involved nodes? -- Reuti > 2. checkpointing with 1 core works > 3. we have a simple test program that exercises MPI with multiple cores and > it checkpoints fine on multiple cores > > Can you suggest how to debug this problem? > > Some additional information: > > 1. I execute the program like this: mpirun -am ft-enable-cr -n 2 -machinefile > machines program inputfile > 2. when I checkpoint it, I see that the checkpoint directories are created > but the file “global_snapshot_meta.data” is not complete, there is no > restart-appfile, the “snapshot_meta.data” files are not complete, and there > are no dump files for the individual processes. > 3. the command “ompi-checkpoint” doesn’t return; I have to control-C to kill > it after checkpointing. > 4. We are using Open MPI 1.4.4 with BLCR 0.8.4 > 5. I have attached “ompi_info.txt” > > Thanks > Tom > > <ompi_info.txt>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users