I suspect the problem is in Intercomm_merge, as the comment in your file suggests. There were some bug fixes in that code, but they haven't migrated to the 1.7 branch yet (scheduled for 1.7.4).
On Oct 17, 2013, at 6:56 AM, "Burns, Andrew J CTR (US)" <andrew.j.burns35....@mail.mil> wrote: > Classification: UNCLASSIFIED > Caveats: NONE > > Possibly related to: > https://svn.open-mpi.org/trac/ompi/ticket/2904 > and > http://www.open-mpi.org/community/lists/devel/2012/09/11509.php > > I am attempting to link communicators from a series of programs together and > am running into inconsistent behavior when using > OpenMPI. > > Attached is a minimalistic example of code that will generate this issue, the > same code executes without issue when using MPICH2. > > The attached code is compiled with the commands: > > mpicxx mpiAccept.cpp -o acceptTest > mpicxx mpiConnect.cpp -o connectTest > mpicxx mpiConnect2.cpp -o connect2Test > > I used gcc 4.4.1 and openmpi 1.6.3 > > > Job file contains the following relevant options: > > #!/bin/tcsh > #PBS -l walltime=00:05:00 > #PBS -l select=3:ncpus=8 > > > and executes the program using the following commands: > > > mpirun --tag-output -n 8 ./acceptTest > logConnect1.log & > > sleep 5 > > mpirun --tag-output -n 8 ./connectTest > logConnect2.log & > > sleep 5 > > mpirun --tag-output -n 8 ./connect2Test > logConnect3.log > > > Note that the number of cores is 8, this is a case that executes properly. > > However, changing the execution commands to the following: > > > mpirun --tag-output -n 7 ./acceptTest > logConnect1.log & > > sleep 5 > > mpirun --tag-output -n 7 ./connectTest > logConnect2.log & > > sleep 5 > > mpirun --tag-output -n 7 ./connect2Test > logConnect3.log > > > causes errors of the form: > > [hostname:31326] [[14363,0],0]:route_callback tried routing message from > [[14363,1],0] to [[14337,1],2]:102, can't find route > [0] func:[higher levels > stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/libopen-rte.so.4(opal_backtrace_print+0x1f) > [0x2ad8c884b9ef] > [1] func:[higher levels > stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/openmpi/mca_rml_oob.so(+0x26ba) > [0x2ad8ca6f26ba] > [2] func:[higher levels > stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x278) > > [0x2ad8cad1b358] > [3] func:[higher levels > stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/openmpi/mca_oob_tcp.so(+0x980a) > [0x2ad8cad1c80a] > [4] func:[higher levels > stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/libopen-rte.so.4(opal_event_base_loop+0x238) > [0x2ad8c8835888] > [5] func:mpirun(orterun+0xe80) [0x404bae] > [6] func:mpirun(main+0x20) [0x403ae4] > [7] func:/lib64/libc.so.6(__libc_start_main+0xe6) [0x2ad8c9797bc6] > [8] func:mpirun() [0x403a09] > > The point of failure seems to be in a MPI_Bcast call. Most of the cores make > it through the call and show the broadcast value as > appropriate. However, there are several cores on the second and third > processes (connectTest and connect2Test) that hang at the last > > broadcast and at least one throws the above error. > > > I have tried several combinations of core amounts and have gotten the > following results: > > Of the form (# acceptTest cores, # connectTest cores, # connect2Test cores) > > Successes: > > 1 1 1 across 1:3 > 2 2 2 across 1:6 > 4 4 4 across 2:8 > 8 8 8 across 3:8 > 16 16 16 across 6:8 > 16 4 4 across 3:8 > 16 4 16 across 5:8 > 8 4 4 across 2:8 > 8 7 7 across 3:8 > 8 7 6 across 3:8 > 4 3 2 across 2:8 > > Failures: > 3 3 3 across 2:8 > 5 5 5 across 2:8 > 6 6 6 across 3:8 > 7 7 7 across 3:8 > 9 9 9 across 4:8 > 10 10 10 across 4:8 > 11 11 11 across 5:8 > 12 12 12 across 5:8 > 13 13 13 across 5:8 > 14 14 14 across 6:8 > 15 15 15 across 6:8 > 4 4 16 across 3:8 > 4 4 8 across 2:8 > > > Other notes: > In the case of 6 6 6 across 3:8 it is consistently cores 0 and 1 of process 2 > and cores 2 and 3 of process 3 that get blocked. > > It seems that the first process must have a number of cores that is a power > of 2 and must also have a number of cores greater than > the > other two processes individually. > > > Other versions of OpenMPI: > > OpenMPI 1.7.2: > Fails in all cases during MPI_Comm_accept/MPI_Comm_connect with the following > error: > > [hostname:16109] [[27626,0],0]:route_callback tried routing message from > [[27626,1],0] to [[27557,1],0]:30, can't find route > [0] func:[higher levels > stripped]/openmpi-1.7.2built/lib/libopen-pal.so.5(opal_backtrace_print+0x1f) > [0x2abd542a876f] > [1] func:[higher levels > stripped]/openmpi-1.7.2built/lib/openmpi/mca_rml_oob.so(+0x25f3) > [0x2abd5676f5f3] > [2] func:[higher levels > stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x2c0) > [0x2abd5697d040] > [3] func:[higher levels > stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(+0xb0a7) > [0x2abd5697f0a7] > [4] func:[higher levels > stripped]/openmpi-1.7.2built/lib/libopen-pal.so.5(opal_libevent2019_event_base_loop+0x323) > [0x2abd542ade63] > [5] func:mpirun(orterun+0xe3b) [0x404c3f] > [6] func:mpirun(main+0x20) [0x403bb4] > [7] func:/lib64/libc.so.6(__libc_start_main+0xe6) [0x2abd55406bc6] > [8] func:mpirun() [0x403ad9] > [hostname:15968] *** Process received signal *** > [hostname:15968] Signal: Segmentation fault (11) > [hostname:15968] Signal code: Address not mapped (1) > [hostname:15968] Failing at address: 0x6ef34010 > [hostname:15968] [ 0] /lib64/libpthread.so.0(+0xf6b0) [0x2b75859cf6b0] > [hostname:15968] [ 1] /lib64/libc.so.6(+0x77d0f) [0x2b7585c54d0f] > [hostname:15968] [ 2] /lib64/libc.so.6(__libc_malloc+0x77) [0x2b7585c572d7] > [hostname:15968] [ 3] [higher levels > stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_handler+0x15f) > > [0x2b75871716af] > [hostname:15968] [ 4] [higher levels > stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(+0xb078) > [0x2b7587174078] > [hostname:15968] [ 5] [higher levels > stripped]/openmpi-1.7.2built/lib/libopen-pal.so.5(opal_libevent2019_event_base_loop+0x323) > > [0x2b7584aa2e63] > [hostname:15968] [ 6] mpirun(orterun+0xe3b) [0x404c3f] > [hostname:15968] [ 7] mpirun(main+0x20) [0x403bb4] > [hostname:15968] [ 8] /lib64/libc.so.6(__libc_start_main+0xe6) > [0x2b7585bfbbc6] > [hostname:15968] [ 9] mpirun() [0x403ad9] > [hostname:15968] *** End of error message *** > > > OpenMPI 1.7.3rc > Fails in all cases during MPI_Comm_accept/MPI_Comm_connect with the following > error: > > [hostname:19222] [[19635,0],0]:route_callback tried routing message from > [[19635,1],0] to [[19793,1],0]:30, can't find route > [0] func:[higher levels > stripped]/openmpi-1.7.3rc3built/lib/libopen-pal.so.6(opal_backtrace_print+0x1f) > [0x2b43eb07088f] > [1] func:[higher levels > stripped]/openmpi-1.7.3rc3built/lib/openmpi/mca_rml_oob.so(+0x2733) > [0x2b43ed55f733] > [2] func:[higher levels > stripped]/openmpi-1.7.3rc3built/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x2c0) > > [0x2b43ed76d440] > [3] func:[higher levels > stripped]/openmpi-1.7.3rc3built/lib/openmpi/mca_oob_tcp.so(+0xb4a7) > [0x2b43ed76f4a7] > [4] func:[higher levels > stripped]/openmpi-1.7.3rc3built/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x88c) > [0x2b43eb07844c] > [5] func:mpirun(orterun+0xe25) [0x404c29] > [6] func:mpirun(main+0x20) [0x403bb4] > [7] func:/lib64/libc.so.6(__libc_start_main+0xe6) [0x2b43ec1d3bc6] > [8] func:mpirun() [0x403ad9] > > > Andrew Burns > Lockheed Martin > Software Engineer > 410-306-0409 > andrew.j.bur...@us.army.mil > andrew.j.burns35....@mail.mil > > Classification: UNCLASSIFIED > Caveats: NONE > > > <test files.zip>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users