It's the --enable-progress-threads flag that causes the problem - we don't really support that yet. Maybe someday.

Take that out and you should be okay, with the caveats expressed on the OMPI web site (i.e., not everything works with threads yet).

On Jun 11, 2009, at 4:56 AM, jody wrote:

More info:
I checked and found that not all nodes are equal:
the ones that don't work have mpi-threads *and* progress-threads enabled,
whereas the ones that work have only mpi-threads enabled

Is there a problem when both thread-types are enabled?

Jody

On Thu, Jun 11, 2009 at 12:19 PM, jody<jody....@gmail.com> wrote:
Hi

After updating all my nodes to Open-MPI 1.3.2 (with
--enable-mpi-threads some of them fail to execute a simple MPI test
program - they seem to hang.
With --debug-daemons the application seems to execute (two line os
output) but hangs before returning:

[jody@aplankton neander]$ mpirun -np 2 --host nano_06 --debug- daemons ./MPITest
Daemon was launched on nano_06 - beginning to initialize
Daemon [[44301,0],1] checking in as pid 5166 on host nano_06
Daemon [[44301,0],1] not using static ports
[nano_06:05166] [[44301,0],1] orted: up and running - waiting for commands! [plankton:23859] [[44301,0],0] node[0].name plankton daemon 0 arch ffca0200 [plankton:23859] [[44301,0],0] node[1].name nano_06 daemon 1 arch ffca0200
[plankton:23859] [[44301,0],0] orted_cmd: received add_local_procs
[nano_06:05166] [[44301,0],1] node[0].name plankton daemon 0 arch ffca0200 [nano_06:05166] [[44301,0],1] node[1].name nano_06 daemon 1 arch ffca0200
[nano_06:05166] [[44301,0],1] orted_cmd: received add_local_procs
[nano_06:05166] [[44301,0],1] orted_recv: received sync+nidmap from
local proc [[44301,1],0]
[nano_06:05166] [[44301,0],1] orted_recv: received sync+nidmap from
local proc [[44301,1],1]
[nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
[plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd [plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs [plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd [plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs
[nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
[nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs
[nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
[nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
[nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs
[nano_06]I am #0/2
[nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
[nano_06]I am #1/2
[plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd [plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs
[nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
[nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs
[nano_06:05166] [[44301,0],1] orted_recv: received sync from local
proc [[44301,1],1]
[nano_06:05166] [[44301,0],1] orted_recv: received sync from local
proc [[44301,1],0]
 (Here it hangs)

Some don't even get to execute:
[jody@plankton neander]$ mpirun -np 2 --host nano_01 --debug- daemons ./MPITest
Daemon was launched on nano_01 - beginning to initialize
Daemon [[44293,0],1] checking in as pid 5044 on host nano_01
Daemon [[44293,0],1] not using static ports
[nano_01:05044] [[44293,0],1] orted: up and running - waiting for commands! [plankton:23867] [[44293,0],0] node[0].name plankton daemon 0 arch ffca0200 [plankton:23867] [[44293,0],0] node[1].name nano_01 daemon 1 arch ffca0200
[plankton:23867] [[44293,0],0] orted_cmd: received add_local_procs
[nano_01:05044] [[44293,0],1] node[0].name plankton daemon 0 arch ffca0200 [nano_01:05044] [[44293,0],1] node[1].name nano_01 daemon 1 arch ffca0200
[nano_01:05044] [[44293,0],1] orted_cmd: received add_local_procs
[nano_01:05044] [[44293,0],1] orted_recv: received sync+nidmap from
local proc [[44293,1],0]
[nano_01:05044] [[44293,0],1] orted_cmd: received collective data cmd
 (Here it hangs)

When i call one of the bad nodes with only 1 processor and debug- daemons, it works fine (output & clean completion), but without debug- daemons it hangs.
But sometimes there is a crash (not always reproducible):

[jody@plankton neander]$ mpirun -np 1 --host nano_04 --debug- daemons ./MPITest
Daemon was launched on nano_04 - beginning to initialize
Daemon [[44431,0],1] checking in as pid 5333 on host nano_04
Daemon [[44431,0],1] not using static ports
[plankton:23985] [[44431,0],0] node[0].name plankton daemon 0 arch ffca0200 [plankton:23985] [[44431,0],0] node[1].name nano_04 daemon 1 arch ffca0200
[plankton:23985] [[44431,0],0] orted_cmd: received add_local_procs
[nano_04:05333] [[44431,0],1] orted: up and running - waiting for commands! [nano_04:05333] [[44431,0],1] node[0].name plankton daemon 0 arch ffca0200 [nano_04:05333] [[44431,0],1] node[1].name nano_04 daemon 1 arch ffca0200
[nano_04:05333] [[44431,0],1] orted_cmd: received add_local_procs
[nano_04:05333] [[44431,0],1] orted_recv: received sync+nidmap from
local proc [[44431,1],0]
[nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd
[plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd [plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs
[nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs
[nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd
[plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd [plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs
[nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs
[nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd
[nano_04]I am #0/1
[plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd [plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs
[nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs
[nano_04:05333] [[44431,0],1] orted_recv: received sync from local
proc [[44431,1],0]
[nano_04:05333] [[44431,0],1] orted_cmd: received iof_complete cmd
[nano_04:05333] [[44431,0],1] orted_cmd: received waitpid_fired cmd
[plankton:23985] [[44431,0],0] orted_cmd: received exit
[nano_04:05333] [[44431,0],1] orted_cmd: received exit
[nano_04:05333] [[44431,0],1] orted: finalizing
[nano_04:05333] *** Process received signal ***
[nano_04:05333] Signal: Segmentation fault (11)
[nano_04:05333] Signal code: Address not mapped (1)
[nano_04:05333] Failing at address: 0xb7493e20
[nano_04:05333] [ 0] [0xffffe40c]
[nano_04:05333] [ 1]
/opt/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x27) [0xb7e65417]
[nano_04:05333] [ 2]
/opt/openmpi/lib/libopen-pal.so.0(opal_event_dispatch+0x1e)
[0xb7e6543e]
[nano_04:05333] [ 3]
/opt/openmpi/lib/libopen-rte.so.0(orte_daemon+0x761) [0xb7ed3d71]
[nano_04:05333] [ 4] orted [0x80487b4]
[nano_04:05333] [ 5] /lib/libc.so.6(__libc_start_main+0xdc) [0xb7cc060c]
[nano_04:05333] [ 6] orted [0x8048691]
[nano_04:05333] *** End of error message ***




Is that perhaps a consequence of configuring with --enable-mpi- threads
and --enable-progress-threads?

Thank You
 Jody


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to