Many thanks for your help, it was not clear to me whether it was opal,
my application or the standard C libs that were causing the segfault. It
is already good news that the problem is not at the level of OpenMPI,
since this would have meant upgrading that library. My first reaction
would be to say that there is nothing wrong with my code (which has
already passed the valgrind test) and the problem should be in the libc,
but I agree with you that this is a very unlikely possibility,
especially given that we do some remapping of the memory. Hence, I will
give a second look with valgrind and a third with efence, and see if
there is some bug that managed to survive the extensive testing that the
code has undergone up to now.
Thanks again,
Biagio
George Bosilca wrote:
Absolutely :) The last few entries on the stack are from OPAL (one of
the Open MPI libraries) that trap the segfault. Everything else
indicates where the segfault happened. What I can tell from this stack
trace is the following: the problem started in your function
wait_thread which called one of the functions from the libstdc++
(based on the C++ naming conventions and the name from the stack
_ZNSt13basic_filebufIcSt11char_traitsIcEE4openEPKcSt13_ I guess it was
open), which called some undetermined function from the libc ... which
segfault.
It is pretty strange to segfault in a standard function, they are
usually pretty well protected, except if you do something blatantly
wrong (such as messing up the memory). I suggest using some memory
checker tools such as valgrind to check the memory consistency of your
application.
george.
On Mar 5, 2009, at 17:37 , Biagio Lucini wrote:
We have an application that runs for a very long time with 16
processes (the time is order a few months; we do have check points,
but this won't be the issue). It has happened twice that it fails
with the error message appended below after running undisturbed for
20-25 days. It has happened twice so far. This error is not
systematically reproducible, and I believe this is not just because
the program is parallel. We use openmpi-1.2.5 as distributed in the
RH 5.2-clone Scientific Linux, on which our cluster is based. Is this
stack suggesting anything to eyes more trained than main?
Many thanks,
Biagio Lucini
-----------------------------------------------------------------------------------------------------------------------------------------
[node20:04178] *** Process received signal ***
[node20:04178] Signal: Segmentation fault (11)
[node20:04178] Signal code: Address not mapped (1)
[node20:04178] Failing at address: 0x2aaadb8b31a0
[node20:04178] [ 0] /lib64/libpthread.so.0 [0x2b5d9c3ebe80]
[node20:04178] [ 1]
/usr/lib64/openmpi/1.2.5-gcc/lib/libopen-pal.so.0(_int_malloc+0x1d4)
[0x2b5d9ccb2
f84]
[node20:04178] [ 2]
/usr/lib64/openmpi/1.2.5-gcc/lib/libopen-pal.so.0(malloc+0x93)
[0x2b5d9ccb4d93]
[node20:04178] [ 3] /lib64/libc.so.6 [0x2b5d9d77729a]
[node20:04178] [ 4]
/usr/lib64/libstdc++.so.6(_ZNSt12__basic_fileIcE4openEPKcSt13_Ios_Openmodei+0x54)
[0x2b5d9bf05cb4]
[node20:04178] [ 5]
/usr/lib64/libstdc++.so.6(_ZNSt13basic_filebufIcSt11char_traitsIcEE4openEPKcSt13_
Ios_Openmode+0x83) [0x2b5d9beb45c3]
[node20:04178] [ 6] ./k-string(wait_thread_+0x2a1) [0x42e101]
[node20:04178] [ 7] ./k-string(MAIN__+0x2a72) [0x4212d2]
[node20:04178] [ 8] ./k-string(main+0xe) [0x42e2ce]
[node20:04178] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x2b5d9d7338b4]
[node20:04178] [10] ./k-string(__gxx_personality_v0+0xb9) [0x404719]
[node20:04178] *** End of error message ***
mpirun noticed that job rank 0 with PID 4152 on node node19 exited on
signal 15 (Terminated).
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users