I work for a supercomputing organization and we just installed the latest
version of rocks/centos on our cluster.  We compiled openmpi from source to
customize it for our purposes.  Since switching we have receive messages
from users about errors, segfaults, etc. that we didn't see before.  Here is
one such segfault message that I don't have enough knowledge to figure out
or even have a clue about what is going on.  Here it is:

[m4b-1-8:11830] *** Process received signal ***
[m4b-1-8:11830] Signal: Segmentation fault (11)
[m4b-1-8:11830] Signal code: Address not mapped (1)
[m4b-1-8:11830] Failing at address: 0x2abcdff475b0
[m4b-1-8:11830] [ 0] /lib64/libpthread.so.0 [0x33e8c0de70]
[m4b-1-8:11830] [ 1]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(mca_btl_sm_send+0xf1)
[0x2aaaaab541d1]
[m4b-1-8:11830] [ 2]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(mca_pml_ob1_send_request_start_copy+0x15e)
[0x2aaaaaba0e2e]
[m4b-1-8:11830] [ 3]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(mca_pml_ob1_isend+0x217)
[0x2aaaaab9be37]
[m4b-1-8:11830] [ 4]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(ompi_coll_tuned_sendrecv_actual+0xda)
[0x2aaaaab5acaa]
[m4b-1-8:11830] [ 5]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(ompi_coll_tuned_barrier_intra_bruck+0x9f)
[0x2aaaaab5f81f]
[m4b-1-8:11830] [ 6]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(PMPI_Barrier+0x6f)
[0x2aaaaab1eadf]
[m4b-1-8:11830] [ 7]
/fslhome/wshuai/compute/for_Shuai2/mpi_md_bgo_twham_12sept08_debug(main+0x5d9)
[0x413179]
[m4b-1-8:11830] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x33e841d8a4]

[m4b-1-8:11830] [ 9]
/fslhome/wshuai/compute/for_Shuai2/mpi_md_bgo_twham_12sept08_debug
[0x404109]
[m4b-1-8:11830] *** End of error message ***
[m4b-1-9:11772] *** Process received signal ***
[m4b-1-9:11772] Signal: Segmentation fault (11)
[m4b-1-9:11772] Signal code: Address not mapped (1)
[m4b-1-9:11772] Failing at address: 0x2abcdff475b0
[m4b-1-9:11772] [ 0] /lib64/libpthread.so.0 [0x302380de70]
[m4b-1-9:11772] [ 1]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(mca_btl_sm_send+0xf1)
[0x2aaaaab541d1]
[m4b-1-9:11772] [ 2]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(mca_pml_ob1_send_request_start_copy+0x15e)
[0x2aaaaaba0e2e]
[m4b-1-9:11772] [ 3]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(mca_pml_ob1_isend+0x217)
[0x2aaaaab9be37]
[m4b-1-9:11772] [ 4]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(ompi_coll_tuned_sendrecv_actual+0xda)
[0x2aaaaab5acaa]
[m4b-1-9:11772] [ 5]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(ompi_coll_tuned_barrier_intra_bruck+0x9f)
[0x2aaaaab5f81f]
[m4b-1-9:11772] [ 6]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(PMPI_Barrier+0x6f)
[0x2aaaaab1eadf]
[m4b-1-9:11772] [ 7]
/fslhome/wshuai/compute/for_Shuai2/mpi_md_bgo_twham_12sept08_debug(main+0x5d9)
[0x413179]
[m4b-1-9:11772] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x302301d8a4]

[m4b-1-9:11772] [ 9]
/fslhome/wshuai/compute/for_Shuai2/mpi_md_bgo_twham_12sept08_debug
[0x404109]
[m4b-1-9:11772] *** End of error message ***
[m4b-1-7i][0,1,7][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=111
[m4b-1-7i][0,1,8][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=111
[m4b-1-7i][0,1,9][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=104
[m4b-1-7i][0,1,9][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=111
[m4b-1-9:11773] *** Process received signal ***
[m4b-1-9:11773] Signal: Segmentation fault (11)
[m4b-1-9:11773] Signal code: Address not mapped (1)
[m4b-1-9:11773] Failing at address: 0x2abcdff475b0
[m4b-1-9:11773] [ 0] /lib64/libpthread.so.0 [0x302380de70]
[m4b-1-9:11773] [ 1]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(mca_btl_sm_send+0xf1)
[0x2aaaaab541d1]
[m4b-1-9:11773] [ 2]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(mca_pml_ob1_send_request_start_copy+0x15e)
[0x2aaaaaba0e2e]
[m4b-1-9:11773] [ 3]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(mca_pml_ob1_isend+0x217)
[0x2aaaaab9be37]
[m4b-1-9:11773] [ 4]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(ompi_coll_tuned_sendrecv_actual+0xda)
[0x2aaaaab5acaa]
[m4b-1-9:11773] [ 5]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(ompi_coll_tuned_barrier_intra_bruck+0x9f)
[0x2aaaaab5f81f]
[m4b-1-9:11773] [ 6]
/usr/mpi/fsl_openmpi_gcc_1.2.6/lib/libmpi.so.0(PMPI_Barrier+0x6f)
[0x2aaaaab1eadf]
[m4b-1-9:11773] [ 7]
/fslhome/wshuai/compute/for_Shuai2/mpi_md_bgo_twham_12sept08_debug(main+0x5d9)
[0x413179]
[m4b-1-9:11773] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x302301d8a4]

[m4b-1-9:11773] [ 9]
/fslhome/wshuai/compute/for_Shuai2/mpi_md_bgo_twham_12sept08_debug
[0x404109]
[m4b-1-9:11773] *** End of error message ***
orterun noticed that job rank 0 with PID 12338 on node m4b-1-10i exited on
signal 15 (Terminated).

Can someone give me some clues as to what is going wrong here or possibly
point me in the right direction?  Is there something I or the user can do to
get more informative error messages?  The user mentioned that this
particular program ran fine before we upgraded to the current openmpi
version, and that he can't find any bugs in his code.

Thanks for your help,

Daniel Hansen
Systems Administrator
BYU Fulton Supercomputing Lab

Reply via email to