Classification: UNCLASSIFIED Caveats: NONE Possibly related to: https://svn.open-mpi.org/trac/ompi/ticket/2904 and http://www.open-mpi.org/community/lists/devel/2012/09/11509.php
I am attempting to link communicators from a series of programs together and am running into inconsistent behavior when using OpenMPI. Attached is a minimalistic example of code that will generate this issue, the same code executes without issue when using MPICH2. The attached code is compiled with the commands: mpicxx mpiAccept.cpp -o acceptTest mpicxx mpiConnect.cpp -o connectTest mpicxx mpiConnect2.cpp -o connect2Test I used gcc 4.4.1 and openmpi 1.6.3 Job file contains the following relevant options: #!/bin/tcsh #PBS -l walltime=00:05:00 #PBS -l select=3:ncpus=8 and executes the program using the following commands: mpirun --tag-output -n 8 ./acceptTest > logConnect1.log & sleep 5 mpirun --tag-output -n 8 ./connectTest > logConnect2.log & sleep 5 mpirun --tag-output -n 8 ./connect2Test > logConnect3.log Note that the number of cores is 8, this is a case that executes properly. However, changing the execution commands to the following: mpirun --tag-output -n 7 ./acceptTest > logConnect1.log & sleep 5 mpirun --tag-output -n 7 ./connectTest > logConnect2.log & sleep 5 mpirun --tag-output -n 7 ./connect2Test > logConnect3.log causes errors of the form: [hostname:31326] [[14363,0],0]:route_callback tried routing message from [[14363,1],0] to [[14337,1],2]:102, can't find route [0] func:[higher levels stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/libopen-rte.so.4(opal_backtrace_print+0x1f) [0x2ad8c884b9ef] [1] func:[higher levels stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/openmpi/mca_rml_oob.so(+0x26ba) [0x2ad8ca6f26ba] [2] func:[higher levels stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x278) [0x2ad8cad1b358] [3] func:[higher levels stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/openmpi/mca_oob_tcp.so(+0x980a) [0x2ad8cad1c80a] [4] func:[higher levels stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/libopen-rte.so.4(opal_event_base_loop+0x238) [0x2ad8c8835888] [5] func:mpirun(orterun+0xe80) [0x404bae] [6] func:mpirun(main+0x20) [0x403ae4] [7] func:/lib64/libc.so.6(__libc_start_main+0xe6) [0x2ad8c9797bc6] [8] func:mpirun() [0x403a09] The point of failure seems to be in a MPI_Bcast call. Most of the cores make it through the call and show the broadcast value as appropriate. However, there are several cores on the second and third processes (connectTest and connect2Test) that hang at the last broadcast and at least one throws the above error. I have tried several combinations of core amounts and have gotten the following results: Of the form (# acceptTest cores, # connectTest cores, # connect2Test cores) Successes: 1 1 1 across 1:3 2 2 2 across 1:6 4 4 4 across 2:8 8 8 8 across 3:8 16 16 16 across 6:8 16 4 4 across 3:8 16 4 16 across 5:8 8 4 4 across 2:8 8 7 7 across 3:8 8 7 6 across 3:8 4 3 2 across 2:8 Failures: 3 3 3 across 2:8 5 5 5 across 2:8 6 6 6 across 3:8 7 7 7 across 3:8 9 9 9 across 4:8 10 10 10 across 4:8 11 11 11 across 5:8 12 12 12 across 5:8 13 13 13 across 5:8 14 14 14 across 6:8 15 15 15 across 6:8 4 4 16 across 3:8 4 4 8 across 2:8 Other notes: In the case of 6 6 6 across 3:8 it is consistently cores 0 and 1 of process 2 and cores 2 and 3 of process 3 that get blocked. It seems that the first process must have a number of cores that is a power of 2 and must also have a number of cores greater than the other two processes individually. Other versions of OpenMPI: OpenMPI 1.7.2: Fails in all cases during MPI_Comm_accept/MPI_Comm_connect with the following error: [hostname:16109] [[27626,0],0]:route_callback tried routing message from [[27626,1],0] to [[27557,1],0]:30, can't find route [0] func:[higher levels stripped]/openmpi-1.7.2built/lib/libopen-pal.so.5(opal_backtrace_print+0x1f) [0x2abd542a876f] [1] func:[higher levels stripped]/openmpi-1.7.2built/lib/openmpi/mca_rml_oob.so(+0x25f3) [0x2abd5676f5f3] [2] func:[higher levels stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x2c0) [0x2abd5697d040] [3] func:[higher levels stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(+0xb0a7) [0x2abd5697f0a7] [4] func:[higher levels stripped]/openmpi-1.7.2built/lib/libopen-pal.so.5(opal_libevent2019_event_base_loop+0x323) [0x2abd542ade63] [5] func:mpirun(orterun+0xe3b) [0x404c3f] [6] func:mpirun(main+0x20) [0x403bb4] [7] func:/lib64/libc.so.6(__libc_start_main+0xe6) [0x2abd55406bc6] [8] func:mpirun() [0x403ad9] [hostname:15968] *** Process received signal *** [hostname:15968] Signal: Segmentation fault (11) [hostname:15968] Signal code: Address not mapped (1) [hostname:15968] Failing at address: 0x6ef34010 [hostname:15968] [ 0] /lib64/libpthread.so.0(+0xf6b0) [0x2b75859cf6b0] [hostname:15968] [ 1] /lib64/libc.so.6(+0x77d0f) [0x2b7585c54d0f] [hostname:15968] [ 2] /lib64/libc.so.6(__libc_malloc+0x77) [0x2b7585c572d7] [hostname:15968] [ 3] [higher levels stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_handler+0x15f) [0x2b75871716af] [hostname:15968] [ 4] [higher levels stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(+0xb078) [0x2b7587174078] [hostname:15968] [ 5] [higher levels stripped]/openmpi-1.7.2built/lib/libopen-pal.so.5(opal_libevent2019_event_base_loop+0x323) [0x2b7584aa2e63] [hostname:15968] [ 6] mpirun(orterun+0xe3b) [0x404c3f] [hostname:15968] [ 7] mpirun(main+0x20) [0x403bb4] [hostname:15968] [ 8] /lib64/libc.so.6(__libc_start_main+0xe6) [0x2b7585bfbbc6] [hostname:15968] [ 9] mpirun() [0x403ad9] [hostname:15968] *** End of error message *** OpenMPI 1.7.3rc Fails in all cases during MPI_Comm_accept/MPI_Comm_connect with the following error: [hostname:19222] [[19635,0],0]:route_callback tried routing message from [[19635,1],0] to [[19793,1],0]:30, can't find route [0] func:[higher levels stripped]/openmpi-1.7.3rc3built/lib/libopen-pal.so.6(opal_backtrace_print+0x1f) [0x2b43eb07088f] [1] func:[higher levels stripped]/openmpi-1.7.3rc3built/lib/openmpi/mca_rml_oob.so(+0x2733) [0x2b43ed55f733] [2] func:[higher levels stripped]/openmpi-1.7.3rc3built/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x2c0) [0x2b43ed76d440] [3] func:[higher levels stripped]/openmpi-1.7.3rc3built/lib/openmpi/mca_oob_tcp.so(+0xb4a7) [0x2b43ed76f4a7] [4] func:[higher levels stripped]/openmpi-1.7.3rc3built/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x88c) [0x2b43eb07844c] [5] func:mpirun(orterun+0xe25) [0x404c29] [6] func:mpirun(main+0x20) [0x403bb4] [7] func:/lib64/libc.so.6(__libc_start_main+0xe6) [0x2b43ec1d3bc6] [8] func:mpirun() [0x403ad9] Andrew Burns Lockheed Martin Software Engineer 410-306-0409 andrew.j.bur...@us.army.mil andrew.j.burns35....@mail.mil Classification: UNCLASSIFIED Caveats: NONE
<<attachment: test_files.zip>>
smime.p7s
Description: S/MIME cryptographic signature