OK, we can try that. Thanks Tom -----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Josh Hursey Sent: Thursday, March 29, 2012 11:22 AM To: Open MPI Users Subject: Re: [OMPI users] Segmentation fault when checkpointing
This is a bit of a non-answer, but can you try the 1.5 series (1.5.5 in the current release)? 1.4 is being phased out, and 1.5 will replace it in the near future. 1.5 has a number of C/R related fixes that might help. -- Josh On Thu, Mar 29, 2012 at 1:12 PM, Linton, Tom <tom.lin...@intel.com> wrote: > We have a legacy application that runs fine on our cluster using Intel > MPI with hundreds of cores. We ported it to OpenMPI so that we could > use BLCR and it runs fine but checkpointing is not working properly: > > > > 1. when we checkpoint with more than 1 core, each MPI rank reports a > segmentation fault for the MPI job and the ompi-checkpoint command > does not return. For example, with two cores we get: > > [tscco28017:16352] *** Process received signal *** > > [tscco28017:16352] Signal: Segmentation fault (11) > > [tscco28017:16352] Signal code: Address not mapped (1) > > [tscco28017:16352] Failing at address: 0x7fffef51 > > [tscco28017:16353] *** Process received signal *** > > [tscco28017:16353] Signal: Segmentation fault (11) > > [tscco28017:16353] Signal code: Address not mapped (1) > > [tscco28017:16353] Failing at address: 0x7fffef51 > > [tscco28017:16353] [ 0] /lib64/libpthread.so.0(+0xf5d0) > [0x7ffff698e5d0] > > [tscco28017:16353] [ 1] [0xf500b0] > > [tscco28017:16353] *** End of error message *** > > [tscco28017:16352] [ 0] /lib64/libpthread.so.0(+0xf5d0) > [0x7ffff698e5d0] > > [tscco28017:16352] [ 1] [0xf500b0] > > [tscco28017:16352] *** End of error message *** > > ---------------------------------------------------------------------- > ---- > > mpirun noticed that process rank 1 with PID 16353 on node tscco28017 > exited on signal 11 (Segmentation fault). > > ---------------------------------------------------------------------- > ---- > > When I execute the TotalView debugger on a resulting core file (I > assume it's for the rank 0 process), Totalview reports a null frame > pointer and the stack is trashed (gdb shows a backtrace with 30 frames > but shows no debug info). > > > > 2. Checkpointing with 1 core on the legacy program works. > > 3. Checkpointing with a simple test program on 16 cores works. > > > > > > Can you suggest how to debug this problem? > > > > Some additional information: > > > > · I execute the program like this: mpirun -am ft-enable-cr -n 2 > -machinefile machines program inputfile > > · We are using Open MPI 1.4.4 with BLCR 0.8.4 > > · OpenMPI and the application were both compiled on the same > machine using the Intel icc 12.0.4 compiler > > · For the failing example, both MPI processes are running on > cores on the same machine node. > > · I have attached "ompi_info.txt" > > · We're running on a single Xeon 5150 node with Gigabit Ethernet. > > · [Reuti: previously I reported a problem involving illegal > instructions but this turned out to be a build problem. Sorry I didn't > answer your response to my previous thread but I was having problems > with accessing this email list at that time.] > > > > Thanks > > Tom > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users