Re: [OMPI users] "casual" error

Biagio Lucini Thu, 5 Mar 2009 18:43:36 -0500

Many thanks for your help, it was not clear to me whether it was opal,my application or the standard C libs that were causing the segfault. Itis already good news that the problem is not at the level of OpenMPI,since this would have meant upgrading that library. My first reactionwould be to say that there is nothing wrong with my code (which hasalready passed the valgrind test) and the problem should be in the libc,but I agree with you that this is a very unlikely possibility,especially given that we do some remapping of the memory. Hence, I willgive a second look with valgrind and a third with efence, and see ifthere is some bug that managed to survive the extensive testing that thecode has undergone up to now.


Thanks again,
Biagio


George Bosilca wrote:

Absolutely :) The last few entries on the stack are from OPAL (one ofthe Open MPI libraries) that trap the segfault. Everything elseindicates where the segfault happened. What I can tell from this stacktrace is the following: the problem started in your functionwait_thread which called one of the functions from the libstdc++(based on the C++ naming conventions and the name from the stack_ZNSt13basic_filebufIcSt11char_traitsIcEE4openEPKcSt13_ I guess it wasopen), which called some undetermined function from the libc ... whichsegfault.
It is pretty strange to segfault in a standard function, they areusually pretty well protected, except if you do something blatantlywrong (such as messing up the memory). I suggest using some memorychecker tools such as valgrind to check the memory consistency of yourapplication.
  george.

On Mar 5, 2009, at 17:37 , Biagio Lucini wrote:
We have an application that runs for a very long time with 16processes (the time is order a few months; we do have check points,but this won't be the issue). It has happened twice that it failswith the error message appended below after running undisturbed for20-25 days. It has happened twice so far. This error is notsystematically reproducible, and I believe this is not just becausethe program is parallel. We use openmpi-1.2.5 as distributed in theRH 5.2-clone Scientific Linux, on which our cluster is based. Is thisstack suggesting anything to eyes more trained than main?
Many thanks,
Biagio Lucini
-----------------------------------------------------------------------------------------------------------------------------------------
[node20:04178] *** Process received signal ***
[node20:04178] Signal: Segmentation fault (11)
[node20:04178] Signal code: Address not mapped (1)
[node20:04178] Failing at address: 0x2aaadb8b31a0
[node20:04178] [ 0] /lib64/libpthread.so.0 [0x2b5d9c3ebe80]
[node20:04178] [ 1]/usr/lib64/openmpi/1.2.5-gcc/lib/libopen-pal.so.0(_int_malloc+0x1d4)[0x2b5d9ccb2
f84]
[node20:04178] [ 2]/usr/lib64/openmpi/1.2.5-gcc/lib/libopen-pal.so.0(malloc+0x93)[0x2b5d9ccb4d93]
[node20:04178] [ 3] /lib64/libc.so.6 [0x2b5d9d77729a]
[node20:04178] [ 4]/usr/lib64/libstdc++.so.6(_ZNSt12__basic_fileIcE4openEPKcSt13_Ios_Openmodei+0x54)
[0x2b5d9bf05cb4]
[node20:04178] [ 5]/usr/lib64/libstdc++.so.6(_ZNSt13basic_filebufIcSt11char_traitsIcEE4openEPKcSt13_
Ios_Openmode+0x83) [0x2b5d9beb45c3]
[node20:04178] [ 6] ./k-string(wait_thread_+0x2a1) [0x42e101]
[node20:04178] [ 7] ./k-string(MAIN__+0x2a72) [0x4212d2]
[node20:04178] [ 8] ./k-string(main+0xe) [0x42e2ce]
[node20:04178] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4)[0x2b5d9d7338b4]
[node20:04178] [10] ./k-string(__gxx_personality_v0+0xb9) [0x404719]
[node20:04178] *** End of error message ***
mpirun noticed that job rank 0 with PID 4152 on node node19 exited onsignal 15 (Terminated).
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] "casual" error

Reply via email to