More info: I checked and found that not all nodes are equal: the ones that don't work have mpi-threads *and* progress-threads enabled, whereas the ones that work have only mpi-threads enabled
Is there a problem when both thread-types are enabled? Jody On Thu, Jun 11, 2009 at 12:19 PM, jody<jody....@gmail.com> wrote: > Hi > > After updating all my nodes to Open-MPI 1.3.2 (with > --enable-mpi-threads some of them fail to execute a simple MPI test > program - they seem to hang. > With --debug-daemons the application seems to execute (two line os > output) but hangs before returning: > > [jody@aplankton neander]$ mpirun -np 2 --host nano_06 --debug-daemons > ./MPITest > Daemon was launched on nano_06 - beginning to initialize > Daemon [[44301,0],1] checking in as pid 5166 on host nano_06 > Daemon [[44301,0],1] not using static ports > [nano_06:05166] [[44301,0],1] orted: up and running - waiting for commands! > [plankton:23859] [[44301,0],0] node[0].name plankton daemon 0 arch ffca0200 > [plankton:23859] [[44301,0],0] node[1].name nano_06 daemon 1 arch ffca0200 > [plankton:23859] [[44301,0],0] orted_cmd: received add_local_procs > [nano_06:05166] [[44301,0],1] node[0].name plankton daemon 0 arch ffca0200 > [nano_06:05166] [[44301,0],1] node[1].name nano_06 daemon 1 arch ffca0200 > [nano_06:05166] [[44301,0],1] orted_cmd: received add_local_procs > [nano_06:05166] [[44301,0],1] orted_recv: received sync+nidmap from > local proc [[44301,1],0] > [nano_06:05166] [[44301,0],1] orted_recv: received sync+nidmap from > local proc [[44301,1],1] > [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd > [plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd > [plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs > [plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd > [plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs > [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd > [nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs > [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd > [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd > [nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs > [nano_06]I am #0/2 > [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd > [nano_06]I am #1/2 > [plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd > [plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs > [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd > [nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs > [nano_06:05166] [[44301,0],1] orted_recv: received sync from local > proc [[44301,1],1] > [nano_06:05166] [[44301,0],1] orted_recv: received sync from local > proc [[44301,1],0] > (Here it hangs) > > Some don't even get to execute: > [jody@plankton neander]$ mpirun -np 2 --host nano_01 --debug-daemons ./MPITest > Daemon was launched on nano_01 - beginning to initialize > Daemon [[44293,0],1] checking in as pid 5044 on host nano_01 > Daemon [[44293,0],1] not using static ports > [nano_01:05044] [[44293,0],1] orted: up and running - waiting for commands! > [plankton:23867] [[44293,0],0] node[0].name plankton daemon 0 arch ffca0200 > [plankton:23867] [[44293,0],0] node[1].name nano_01 daemon 1 arch ffca0200 > [plankton:23867] [[44293,0],0] orted_cmd: received add_local_procs > [nano_01:05044] [[44293,0],1] node[0].name plankton daemon 0 arch ffca0200 > [nano_01:05044] [[44293,0],1] node[1].name nano_01 daemon 1 arch ffca0200 > [nano_01:05044] [[44293,0],1] orted_cmd: received add_local_procs > [nano_01:05044] [[44293,0],1] orted_recv: received sync+nidmap from > local proc [[44293,1],0] > [nano_01:05044] [[44293,0],1] orted_cmd: received collective data cmd > (Here it hangs) > > When i call one of the bad nodes with only 1 processor and debug-daemons, > it works fine (output & clean completion), but without debug-daemons it hangs. > But sometimes there is a crash (not always reproducible): > > [jody@plankton neander]$ mpirun -np 1 --host nano_04 --debug-daemons ./MPITest > Daemon was launched on nano_04 - beginning to initialize > Daemon [[44431,0],1] checking in as pid 5333 on host nano_04 > Daemon [[44431,0],1] not using static ports > [plankton:23985] [[44431,0],0] node[0].name plankton daemon 0 arch ffca0200 > [plankton:23985] [[44431,0],0] node[1].name nano_04 daemon 1 arch ffca0200 > [plankton:23985] [[44431,0],0] orted_cmd: received add_local_procs > [nano_04:05333] [[44431,0],1] orted: up and running - waiting for commands! > [nano_04:05333] [[44431,0],1] node[0].name plankton daemon 0 arch ffca0200 > [nano_04:05333] [[44431,0],1] node[1].name nano_04 daemon 1 arch ffca0200 > [nano_04:05333] [[44431,0],1] orted_cmd: received add_local_procs > [nano_04:05333] [[44431,0],1] orted_recv: received sync+nidmap from > local proc [[44431,1],0] > [nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd > [plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd > [plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs > [nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs > [nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd > [plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd > [plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs > [nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs > [nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd > [nano_04]I am #0/1 > [plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd > [plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs > [nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs > [nano_04:05333] [[44431,0],1] orted_recv: received sync from local > proc [[44431,1],0] > [nano_04:05333] [[44431,0],1] orted_cmd: received iof_complete cmd > [nano_04:05333] [[44431,0],1] orted_cmd: received waitpid_fired cmd > [plankton:23985] [[44431,0],0] orted_cmd: received exit > [nano_04:05333] [[44431,0],1] orted_cmd: received exit > [nano_04:05333] [[44431,0],1] orted: finalizing > [nano_04:05333] *** Process received signal *** > [nano_04:05333] Signal: Segmentation fault (11) > [nano_04:05333] Signal code: Address not mapped (1) > [nano_04:05333] Failing at address: 0xb7493e20 > [nano_04:05333] [ 0] [0xffffe40c] > [nano_04:05333] [ 1] > /opt/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x27) [0xb7e65417] > [nano_04:05333] [ 2] > /opt/openmpi/lib/libopen-pal.so.0(opal_event_dispatch+0x1e) > [0xb7e6543e] > [nano_04:05333] [ 3] > /opt/openmpi/lib/libopen-rte.so.0(orte_daemon+0x761) [0xb7ed3d71] > [nano_04:05333] [ 4] orted [0x80487b4] > [nano_04:05333] [ 5] /lib/libc.so.6(__libc_start_main+0xdc) [0xb7cc060c] > [nano_04:05333] [ 6] orted [0x8048691] > [nano_04:05333] *** End of error message *** > > > > > Is that perhaps a consequence of configuring with --enable-mpi-threads > and --enable-progress-threads? > > Thank You > Jody >