This is a bit of a non-answer, but can you try the 1.5 series (1.5.5 in the current release)? 1.4 is being phased out, and 1.5 will replace it in the near future. 1.5 has a number of C/R related fixes that might help.
-- Josh On Thu, Mar 29, 2012 at 1:12 PM, Linton, Tom <tom.lin...@intel.com> wrote: > We have a legacy application that runs fine on our cluster using Intel MPI > with hundreds of cores. We ported it to OpenMPI so that we could use BLCR > and it runs fine but checkpointing is not working properly: > > > > 1. when we checkpoint with more than 1 core, each MPI rank reports a > segmentation fault for the MPI job and the ompi-checkpoint command does not > return. For example, with two cores we get: > > [tscco28017:16352] *** Process received signal *** > > [tscco28017:16352] Signal: Segmentation fault (11) > > [tscco28017:16352] Signal code: Address not mapped (1) > > [tscco28017:16352] Failing at address: 0x7fffef51 > > [tscco28017:16353] *** Process received signal *** > > [tscco28017:16353] Signal: Segmentation fault (11) > > [tscco28017:16353] Signal code: Address not mapped (1) > > [tscco28017:16353] Failing at address: 0x7fffef51 > > [tscco28017:16353] [ 0] /lib64/libpthread.so.0(+0xf5d0) [0x7ffff698e5d0] > > [tscco28017:16353] [ 1] [0xf500b0] > > [tscco28017:16353] *** End of error message *** > > [tscco28017:16352] [ 0] /lib64/libpthread.so.0(+0xf5d0) [0x7ffff698e5d0] > > [tscco28017:16352] [ 1] [0xf500b0] > > [tscco28017:16352] *** End of error message *** > > -------------------------------------------------------------------------- > > mpirun noticed that process rank 1 with PID 16353 on node tscco28017 exited > on signal 11 (Segmentation fault). > > -------------------------------------------------------------------------- > > When I execute the TotalView debugger on a resulting core file (I assume > it’s for the rank 0 process), Totalview reports a null frame pointer and the > stack is trashed (gdb shows a backtrace with 30 frames but shows no debug > info). > > > > 2. Checkpointing with 1 core on the legacy program works. > > 3. Checkpointing with a simple test program on 16 cores works. > > > > > > Can you suggest how to debug this problem? > > > > Some additional information: > > > > · I execute the program like this: mpirun -am ft-enable-cr -n 2 > -machinefile machines program inputfile > > · We are using Open MPI 1.4.4 with BLCR 0.8.4 > > · OpenMPI and the application were both compiled on the same machine > using the Intel icc 12.0.4 compiler > > · For the failing example, both MPI processes are running on cores on > the same machine node. > > · I have attached “ompi_info.txt” > > · We’re running on a single Xeon 5150 node with Gigabit Ethernet. > > · [Reuti: previously I reported a problem involving illegal > instructions but this turned out to be a build problem. Sorry I didn’t > answer your response to my previous thread but I was having problems with > accessing this email list at that time.] > > > > Thanks > > Tom > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey