Re: [OMPI users] Segmentation fault when checkpointing

Linton, Tom Thu, 29 Mar 2012 15:26:25 -0400

OK, we can try that. 
Thanks
Tom

-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Josh Hursey
Sent: Thursday, March 29, 2012 11:22 AM
To: Open MPI Users
Subject: Re: [OMPI users] Segmentation fault when checkpointing


This is a bit of a non-answer, but can you try the 1.5 series (1.5.5 in the 
current release)? 1.4 is being phased out, and 1.5 will replace it in the near 
future. 1.5 has a number of C/R related fixes that might help.

-- Josh

On Thu, Mar 29, 2012 at 1:12 PM, Linton, Tom <tom.lin...@intel.com> wrote:
> We have a legacy application that runs fine on our cluster using Intel 
> MPI with hundreds of cores. We ported it to OpenMPI so that we could 
> use BLCR and it runs fine but checkpointing is not working properly:
>
>
>
> 1. when we checkpoint with more than 1 core, each MPI rank reports a 
> segmentation fault for the MPI job and the ompi-checkpoint command 
> does not return. For example, with two cores we get:
>
> [tscco28017:16352] *** Process received signal ***
>
> [tscco28017:16352] Signal: Segmentation fault (11)
>
> [tscco28017:16352] Signal code: Address not mapped (1)
>
> [tscco28017:16352] Failing at address: 0x7fffef51
>
> [tscco28017:16353] *** Process received signal ***
>
> [tscco28017:16353] Signal: Segmentation fault (11)
>
> [tscco28017:16353] Signal code: Address not mapped (1)
>
> [tscco28017:16353] Failing at address: 0x7fffef51
>
> [tscco28017:16353] [ 0] /lib64/libpthread.so.0(+0xf5d0) 
> [0x7ffff698e5d0]
>
> [tscco28017:16353] [ 1] [0xf500b0]
>
> [tscco28017:16353] *** End of error message ***
>
> [tscco28017:16352] [ 0] /lib64/libpthread.so.0(+0xf5d0) 
> [0x7ffff698e5d0]
>
> [tscco28017:16352] [ 1] [0xf500b0]
>
> [tscco28017:16352] *** End of error message ***
>
> ----------------------------------------------------------------------
> ----
>
> mpirun noticed that process rank 1 with PID 16353 on node tscco28017 
> exited on signal 11 (Segmentation fault).
>
> ----------------------------------------------------------------------
> ----
>
> When I execute the TotalView debugger on a resulting core file (I 
> assume it's for the rank 0 process), Totalview reports a null frame 
> pointer and the stack is trashed (gdb shows a backtrace with 30 frames 
> but shows no debug info).
>
>
>
> 2. Checkpointing with 1 core on the legacy program works.
>
> 3. Checkpointing with a simple test program on 16 cores works.
>
>
>
>
>
> Can you suggest how to debug this problem?
>
>
>
> Some additional information:
>
>
>
> ·        I execute the program like this: mpirun -am ft-enable-cr -n 2 
> -machinefile machines program inputfile
>
> ·        We are using Open MPI 1.4.4 with BLCR 0.8.4
>
> ·        OpenMPI and the application were both compiled on the same 
> machine using the Intel icc 12.0.4 compiler
>
> ·        For the failing example, both MPI processes are running on 
> cores on the same machine node.
>
> ·        I have attached "ompi_info.txt"
>
> ·        We're running on a single Xeon 5150 node with Gigabit Ethernet.
>
> ·        [Reuti: previously I reported a problem involving illegal 
> instructions but this turned out to be a build problem. Sorry I didn't 
> answer your response to my previous thread but I was having problems 
> with accessing this email list at that time.]
>
>
>
> Thanks
>
> Tom
>
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Segmentation fault when checkpointing

Reply via email to