Many thanks for your help, it was not clear to me whether it was opal, my application or the standard C libs that were causing the segfault. It is already good news that the problem is not at the level of OpenMPI, since this would have meant upgrading that library. My first reaction would be to say that there is nothing wrong with my code (which has already passed the valgrind test) and the problem should be in the libc, but I agree with you that this is a very unlikely possibility, especially given that we do some remapping of the memory. Hence, I will give a second look with valgrind and a third with efence, and see if there is some bug that managed to survive the extensive testing that the code has undergone up to now.

Thanks again,
Biagio

George Bosilca wrote:
Absolutely :) The last few entries on the stack are from OPAL (one of the Open MPI libraries) that trap the segfault. Everything else indicates where the segfault happened. What I can tell from this stack trace is the following: the problem started in your function wait_thread which called one of the functions from the libstdc++ (based on the C++ naming conventions and the name from the stack _ZNSt13basic_filebufIcSt11char_traitsIcEE4openEPKcSt13_ I guess it was open), which called some undetermined function from the libc ... which segfault.

It is pretty strange to segfault in a standard function, they are usually pretty well protected, except if you do something blatantly wrong (such as messing up the memory). I suggest using some memory checker tools such as valgrind to check the memory consistency of your application.

  george.

On Mar 5, 2009, at 17:37 , Biagio Lucini wrote:

We have an application that runs for a very long time with 16 processes (the time is order a few months; we do have check points, but this won't be the issue). It has happened twice that it fails with the error message appended below after running undisturbed for 20-25 days. It has happened twice so far. This error is not systematically reproducible, and I believe this is not just because the program is parallel. We use openmpi-1.2.5 as distributed in the RH 5.2-clone Scientific Linux, on which our cluster is based. Is this stack suggesting anything to eyes more trained than main?

Many thanks,
Biagio Lucini

-----------------------------------------------------------------------------------------------------------------------------------------

[node20:04178] *** Process received signal ***
[node20:04178] Signal: Segmentation fault (11)
[node20:04178] Signal code: Address not mapped (1)
[node20:04178] Failing at address: 0x2aaadb8b31a0
[node20:04178] [ 0] /lib64/libpthread.so.0 [0x2b5d9c3ebe80]
[node20:04178] [ 1] /usr/lib64/openmpi/1.2.5-gcc/lib/libopen-pal.so.0(_int_malloc+0x1d4) [0x2b5d9ccb2
f84]
[node20:04178] [ 2] /usr/lib64/openmpi/1.2.5-gcc/lib/libopen-pal.so.0(malloc+0x93) [0x2b5d9ccb4d93]
[node20:04178] [ 3] /lib64/libc.so.6 [0x2b5d9d77729a]
[node20:04178] [ 4] /usr/lib64/libstdc++.so.6(_ZNSt12__basic_fileIcE4openEPKcSt13_Ios_Openmodei+0x54)
[0x2b5d9bf05cb4]
[node20:04178] [ 5] /usr/lib64/libstdc++.so.6(_ZNSt13basic_filebufIcSt11char_traitsIcEE4openEPKcSt13_
Ios_Openmode+0x83) [0x2b5d9beb45c3]
[node20:04178] [ 6] ./k-string(wait_thread_+0x2a1) [0x42e101]
[node20:04178] [ 7] ./k-string(MAIN__+0x2a72) [0x4212d2]
[node20:04178] [ 8] ./k-string(main+0xe) [0x42e2ce]
[node20:04178] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b5d9d7338b4]
[node20:04178] [10] ./k-string(__gxx_personality_v0+0xb9) [0x404719]
[node20:04178] *** End of error message ***
mpirun noticed that job rank 0 with PID 4152 on node node19 exited on signal 15 (Terminated).

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to