Many thanks for your help, it was not clear to me whether it was opal,
my application or the standard C libs that were causing the segfault. It
is already good news that the problem is not at the level of OpenMPI,
since this would have meant upgrading that library. My first reaction
would be to
Absolutely :) The last few entries on the stack are from OPAL (one of
the Open MPI libraries) that trap the segfault. Everything else
indicates where the segfault happened. What I can tell from this stack
trace is the following: the problem started in your function
wait_thread which called
We have an application that runs for a very long time with 16 processes
(the time is order a few months; we do have check points, but this won't
be the issue). It has happened twice that it fails with the error
message appended below after running undisturbed for 20-25 days. It has
happened twi