Ralph Castain ha scritto:
> 3. remove the threaded launch scenario and just call comm_spawn in a
> loop.
>
Below you find how openmpi works, if I put the MPI_Comm_spawn() in a
loop and I drive
the rest of the communication in a thread. Basically it freeze in the
same place as I see

[roberto@master TestOpenMPI]$ mpirun --verbose --debug-daemons -wdir
"`pwd`" -np 1 testmaster 100000 $PBS_NODEFILE
[master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received
add_local_procs
[master.tekno-soft.it:13161] [[2618,0],0] node[0].name master daemon 0
arch ffc91200
[master.tekno-soft.it:13161] [[2618,0],0] node[1].name cluster4 daemon
INVALID arch ffc91200
[master.tekno-soft.it:13161] [[2618,0],0] node[2].name cluster3 daemon
INVALID arch ffc91200
[master.tekno-soft.it:13161] [[2618,0],0] node[3].name cluster2 daemon
INVALID arch ffc91200
[master.tekno-soft.it:13161] [[2618,0],0] node[4].name cluster1 daemon
INVALID arch ffc91200
Initializing MPI ...
[master.tekno-soft.it:13161] [[2618,0],0] orted_recv: received
sync+nidmap from local proc [[2618,1],0]
[master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received collective
data cmd
[master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received
message_local_procs
[master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received collective
data cmd
[master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received
message_local_procs
Loading the node's ring from file
'/var/torque/aux//929.master.tekno-soft.it'
... adding node #1 host is 'cluster4.tekno-soft.it'
... adding node #2 host is 'cluster3.tekno-soft.it'
... adding node #3 host is 'cluster2.tekno-soft.it'
... adding node #4 host is 'cluster1.tekno-soft.it'
A 4 node's ring has been made
At least one node is available, let's start to distribute 100000 job
across 4 nodes!!!
Setting up the host as 'cluster4.tekno-soft.it'
Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it'
Daemon was launched on cluster4.tekno-soft.it - beginning to initialize
Daemon [[2618,0],1] checking in as pid 10029 on host cluster4.tekno-soft.it
Daemon [[2618,0],1] not using static ports
[cluster4.tekno-soft.it:10029] [[2618,0],1] orted: up and running -
waiting for commands!
[master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received
add_local_procs
[master.tekno-soft.it:13161] [[2618,0],0] node[0].name master daemon 0
arch ffc91200
[master.tekno-soft.it:13161] [[2618,0],0] node[1].name cluster4 daemon 1
arch ffc91200
[master.tekno-soft.it:13161] [[2618,0],0] node[2].name cluster3 daemon
INVALID arch ffc91200
[master.tekno-soft.it:13161] [[2618,0],0] node[3].name cluster2 daemon
INVALID arch ffc91200
[master.tekno-soft.it:13161] [[2618,0],0] node[4].name cluster1 daemon
INVALID arch ffc91200
[cluster4.tekno-soft.it:10029] [[2618,0],1] orted_cmd: received
add_local_procs
[cluster4.tekno-soft.it:10029] [[2618,0],1] node[0].name master daemon 0
arch ffc91200
[cluster4.tekno-soft.it:10029] [[2618,0],1] node[1].name cluster4 daemon
1 arch ffc91200
[cluster4.tekno-soft.it:10029] [[2618,0],1] node[2].name cluster3 daemon
INVALID arch ffc91200
[cluster4.tekno-soft.it:10029] [[2618,0],1] node[3].name cluster2 daemon
INVALID arch ffc91200
[cluster4.tekno-soft.it:10029] [[2618,0],1] node[4].name cluster1 daemon
INVALID arch ffc91200
[cluster4.tekno-soft.it:10029] [[2618,0],1] orted_recv: received
sync+nidmap from local proc [[2618,2],0]
[cluster4.tekno-soft.it:10029] [[2618,0],1] orted_cmd: received
collective data cmd
[master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received collective
data cmd
[master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received
message_local_procs
[cluster4.tekno-soft.it:10029] [[2618,0],1] orted_cmd: received
message_local_procs
[cluster4.tekno-soft.it:10029] [[2618,0],1] orted_cmd: received
collective data cmd
[master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received collective
data cmd
[master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received
message_local_procs
[cluster4.tekno-soft.it:10029] [[2618,0],1] orted_cmd: received
message_local_procs
Killed
[cluster4.tekno-soft.it:10029] [[2618,0],1] routed:binomial: Connection
to lifeline [[2618,0],0] lost
[roberto@master TestOpenMPI]$

Reply via email to