Classification: UNCLASSIFIED
Caveats: NONE

Possibly related to:
https://svn.open-mpi.org/trac/ompi/ticket/2904
and
http://www.open-mpi.org/community/lists/devel/2012/09/11509.php

I am attempting to link communicators from a series of programs together and am 
running into inconsistent behavior when using
OpenMPI.

Attached is a minimalistic example of code that will generate this issue, the 
same code executes without issue when using MPICH2.

The attached code is compiled with the commands:

mpicxx mpiAccept.cpp -o acceptTest
mpicxx mpiConnect.cpp -o connectTest
mpicxx mpiConnect2.cpp -o connect2Test

I used gcc 4.4.1 and openmpi 1.6.3


Job file contains the following relevant options:

#!/bin/tcsh
#PBS -l walltime=00:05:00
#PBS -l select=3:ncpus=8


and executes the program using the following commands:


mpirun --tag-output -n 8 ./acceptTest > logConnect1.log &

sleep 5

mpirun --tag-output -n 8 ./connectTest > logConnect2.log &

sleep 5

mpirun --tag-output -n 8 ./connect2Test > logConnect3.log


Note that the number of cores is 8, this is a case that executes properly.

However, changing the execution commands to the following:


mpirun --tag-output -n 7 ./acceptTest > logConnect1.log &

sleep 5

mpirun --tag-output -n 7 ./connectTest > logConnect2.log &

sleep 5

mpirun --tag-output -n 7 ./connect2Test > logConnect3.log


causes errors of the form:

[hostname:31326] [[14363,0],0]:route_callback tried routing message from
[[14363,1],0] to [[14337,1],2]:102, can't find route
[0] func:[higher levels 
stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
 [0x2ad8c884b9ef]
[1] func:[higher levels 
stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/openmpi/mca_rml_oob.so(+0x26ba) 
[0x2ad8ca6f26ba]
[2] func:[higher levels 
stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x278)
 
[0x2ad8cad1b358]
[3] func:[higher levels 
stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/openmpi/mca_oob_tcp.so(+0x980a) 
[0x2ad8cad1c80a]
[4] func:[higher levels 
stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
 [0x2ad8c8835888]
[5] func:mpirun(orterun+0xe80) [0x404bae]
[6] func:mpirun(main+0x20) [0x403ae4]
[7] func:/lib64/libc.so.6(__libc_start_main+0xe6) [0x2ad8c9797bc6]
[8] func:mpirun() [0x403a09]

The point of failure seems to be in a MPI_Bcast call. Most of the cores make it 
through the call and show the broadcast value as 
appropriate. However, there are several cores on the second and third processes 
(connectTest and connect2Test) that hang at the last

broadcast and at least one throws the above error.


I have tried several combinations of core amounts and have gotten the following 
results:

Of the form (# acceptTest cores, # connectTest cores, # connect2Test cores)

Successes:

1 1 1 across 1:3
2 2 2 across 1:6
4 4 4 across 2:8
8 8 8 across 3:8
16 16 16 across 6:8
16 4 4 across 3:8
16 4 16 across 5:8
8 4 4 across 2:8
8 7 7 across 3:8
8 7 6 across 3:8
4 3 2 across 2:8

Failures:
3 3 3 across 2:8
5 5 5 across 2:8
6 6 6 across 3:8
7 7 7 across 3:8
9 9 9 across 4:8
10 10 10 across 4:8
11 11 11 across 5:8
12 12 12 across 5:8
13 13 13 across 5:8
14 14 14 across 6:8
15 15 15 across 6:8
4 4 16 across 3:8
4 4 8 across 2:8


Other notes:
In the case of 6 6 6 across 3:8 it is consistently cores 0 and 1 of process 2 
and cores 2 and 3 of process 3 that get blocked.

It seems that the first process must have a number of cores that is a power of 
2 and must also have a number of cores greater than
the 
other two processes individually.


Other versions of OpenMPI:

OpenMPI 1.7.2:
Fails in all cases during MPI_Comm_accept/MPI_Comm_connect with the following 
error:

[hostname:16109] [[27626,0],0]:route_callback tried routing message from 
[[27626,1],0] to [[27557,1],0]:30, can't find route
[0] func:[higher levels 
stripped]/openmpi-1.7.2built/lib/libopen-pal.so.5(opal_backtrace_print+0x1f) 
[0x2abd542a876f]
[1] func:[higher levels 
stripped]/openmpi-1.7.2built/lib/openmpi/mca_rml_oob.so(+0x25f3) 
[0x2abd5676f5f3]
[2] func:[higher levels 
stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x2c0)
[0x2abd5697d040]
[3] func:[higher levels 
stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(+0xb0a7) 
[0x2abd5697f0a7]
[4] func:[higher levels 
stripped]/openmpi-1.7.2built/lib/libopen-pal.so.5(opal_libevent2019_event_base_loop+0x323)
 [0x2abd542ade63]
[5] func:mpirun(orterun+0xe3b) [0x404c3f]
[6] func:mpirun(main+0x20) [0x403bb4]
[7] func:/lib64/libc.so.6(__libc_start_main+0xe6) [0x2abd55406bc6]
[8] func:mpirun() [0x403ad9]
[hostname:15968] *** Process received signal ***
[hostname:15968] Signal: Segmentation fault (11)
[hostname:15968] Signal code: Address not mapped (1)
[hostname:15968] Failing at address: 0x6ef34010
[hostname:15968] [ 0] /lib64/libpthread.so.0(+0xf6b0) [0x2b75859cf6b0]
[hostname:15968] [ 1] /lib64/libc.so.6(+0x77d0f) [0x2b7585c54d0f]
[hostname:15968] [ 2] /lib64/libc.so.6(__libc_malloc+0x77) [0x2b7585c572d7]
[hostname:15968] [ 3] [higher levels 
stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_handler+0x15f)
 
[0x2b75871716af]
[hostname:15968] [ 4] [higher levels 
stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(+0xb078) 
[0x2b7587174078]
[hostname:15968] [ 5] [higher levels 
stripped]/openmpi-1.7.2built/lib/libopen-pal.so.5(opal_libevent2019_event_base_loop+0x323)
 
[0x2b7584aa2e63]
[hostname:15968] [ 6] mpirun(orterun+0xe3b) [0x404c3f]
[hostname:15968] [ 7] mpirun(main+0x20) [0x403bb4]
[hostname:15968] [ 8] /lib64/libc.so.6(__libc_start_main+0xe6) [0x2b7585bfbbc6]
[hostname:15968] [ 9] mpirun() [0x403ad9]
[hostname:15968] *** End of error message ***


OpenMPI 1.7.3rc
Fails in all cases during MPI_Comm_accept/MPI_Comm_connect with the following 
error:

[hostname:19222] [[19635,0],0]:route_callback tried routing message from 
[[19635,1],0] to [[19793,1],0]:30, can't find route
[0] func:[higher levels 
stripped]/openmpi-1.7.3rc3built/lib/libopen-pal.so.6(opal_backtrace_print+0x1f) 
[0x2b43eb07088f]
[1] func:[higher levels 
stripped]/openmpi-1.7.3rc3built/lib/openmpi/mca_rml_oob.so(+0x2733) 
[0x2b43ed55f733]
[2] func:[higher levels 
stripped]/openmpi-1.7.3rc3built/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x2c0)
 
[0x2b43ed76d440]
[3] func:[higher levels 
stripped]/openmpi-1.7.3rc3built/lib/openmpi/mca_oob_tcp.so(+0xb4a7) 
[0x2b43ed76f4a7]
[4] func:[higher levels 
stripped]/openmpi-1.7.3rc3built/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x88c)
[0x2b43eb07844c]
[5] func:mpirun(orterun+0xe25) [0x404c29]
[6] func:mpirun(main+0x20) [0x403bb4]
[7] func:/lib64/libc.so.6(__libc_start_main+0xe6) [0x2b43ec1d3bc6]
[8] func:mpirun() [0x403ad9]


Andrew Burns
Lockheed Martin
Software Engineer
410-306-0409
andrew.j.bur...@us.army.mil
andrew.j.burns35....@mail.mil

Classification: UNCLASSIFIED
Caveats: NONE


<<attachment: test_files.zip>>

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to