Hi After updating all my nodes to Open-MPI 1.3.2 (with --enable-mpi-threads some of them fail to execute a simple MPI test program - they seem to hang. With --debug-daemons the application seems to execute (two line os output) but hangs before returning:
[jody@aplankton neander]$ mpirun -np 2 --host nano_06 --debug-daemons ./MPITest Daemon was launched on nano_06 - beginning to initialize Daemon [[44301,0],1] checking in as pid 5166 on host nano_06 Daemon [[44301,0],1] not using static ports [nano_06:05166] [[44301,0],1] orted: up and running - waiting for commands! [plankton:23859] [[44301,0],0] node[0].name plankton daemon 0 arch ffca0200 [plankton:23859] [[44301,0],0] node[1].name nano_06 daemon 1 arch ffca0200 [plankton:23859] [[44301,0],0] orted_cmd: received add_local_procs [nano_06:05166] [[44301,0],1] node[0].name plankton daemon 0 arch ffca0200 [nano_06:05166] [[44301,0],1] node[1].name nano_06 daemon 1 arch ffca0200 [nano_06:05166] [[44301,0],1] orted_cmd: received add_local_procs [nano_06:05166] [[44301,0],1] orted_recv: received sync+nidmap from local proc [[44301,1],0] [nano_06:05166] [[44301,0],1] orted_recv: received sync+nidmap from local proc [[44301,1],1] [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd [plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd [plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs [plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd [plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd [nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd [nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs [nano_06]I am #0/2 [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd [nano_06]I am #1/2 [plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd [plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd [nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs [nano_06:05166] [[44301,0],1] orted_recv: received sync from local proc [[44301,1],1] [nano_06:05166] [[44301,0],1] orted_recv: received sync from local proc [[44301,1],0] (Here it hangs) Some don't even get to execute: [jody@plankton neander]$ mpirun -np 2 --host nano_01 --debug-daemons ./MPITest Daemon was launched on nano_01 - beginning to initialize Daemon [[44293,0],1] checking in as pid 5044 on host nano_01 Daemon [[44293,0],1] not using static ports [nano_01:05044] [[44293,0],1] orted: up and running - waiting for commands! [plankton:23867] [[44293,0],0] node[0].name plankton daemon 0 arch ffca0200 [plankton:23867] [[44293,0],0] node[1].name nano_01 daemon 1 arch ffca0200 [plankton:23867] [[44293,0],0] orted_cmd: received add_local_procs [nano_01:05044] [[44293,0],1] node[0].name plankton daemon 0 arch ffca0200 [nano_01:05044] [[44293,0],1] node[1].name nano_01 daemon 1 arch ffca0200 [nano_01:05044] [[44293,0],1] orted_cmd: received add_local_procs [nano_01:05044] [[44293,0],1] orted_recv: received sync+nidmap from local proc [[44293,1],0] [nano_01:05044] [[44293,0],1] orted_cmd: received collective data cmd (Here it hangs) When i call one of the bad nodes with only 1 processor and debug-daemons, it works fine (output & clean completion), but without debug-daemons it hangs. But sometimes there is a crash (not always reproducible): [jody@plankton neander]$ mpirun -np 1 --host nano_04 --debug-daemons ./MPITest Daemon was launched on nano_04 - beginning to initialize Daemon [[44431,0],1] checking in as pid 5333 on host nano_04 Daemon [[44431,0],1] not using static ports [plankton:23985] [[44431,0],0] node[0].name plankton daemon 0 arch ffca0200 [plankton:23985] [[44431,0],0] node[1].name nano_04 daemon 1 arch ffca0200 [plankton:23985] [[44431,0],0] orted_cmd: received add_local_procs [nano_04:05333] [[44431,0],1] orted: up and running - waiting for commands! [nano_04:05333] [[44431,0],1] node[0].name plankton daemon 0 arch ffca0200 [nano_04:05333] [[44431,0],1] node[1].name nano_04 daemon 1 arch ffca0200 [nano_04:05333] [[44431,0],1] orted_cmd: received add_local_procs [nano_04:05333] [[44431,0],1] orted_recv: received sync+nidmap from local proc [[44431,1],0] [nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd [plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd [plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs [nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs [nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd [plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd [plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs [nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs [nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd [nano_04]I am #0/1 [plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd [plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs [nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs [nano_04:05333] [[44431,0],1] orted_recv: received sync from local proc [[44431,1],0] [nano_04:05333] [[44431,0],1] orted_cmd: received iof_complete cmd [nano_04:05333] [[44431,0],1] orted_cmd: received waitpid_fired cmd [plankton:23985] [[44431,0],0] orted_cmd: received exit [nano_04:05333] [[44431,0],1] orted_cmd: received exit [nano_04:05333] [[44431,0],1] orted: finalizing [nano_04:05333] *** Process received signal *** [nano_04:05333] Signal: Segmentation fault (11) [nano_04:05333] Signal code: Address not mapped (1) [nano_04:05333] Failing at address: 0xb7493e20 [nano_04:05333] [ 0] [0xffffe40c] [nano_04:05333] [ 1] /opt/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x27) [0xb7e65417] [nano_04:05333] [ 2] /opt/openmpi/lib/libopen-pal.so.0(opal_event_dispatch+0x1e) [0xb7e6543e] [nano_04:05333] [ 3] /opt/openmpi/lib/libopen-rte.so.0(orte_daemon+0x761) [0xb7ed3d71] [nano_04:05333] [ 4] orted [0x80487b4] [nano_04:05333] [ 5] /lib/libc.so.6(__libc_start_main+0xdc) [0xb7cc060c] [nano_04:05333] [ 6] orted [0x8048691] [nano_04:05333] *** End of error message *** Is that perhaps a consequence of configuring with --enable-mpi-threads and --enable-progress-threads? Thank You Jody