Ralph Castain ha scritto: > 3. remove the threaded launch scenario and just call comm_spawn in a > loop. > Below you find how openmpi works, if I put the MPI_Comm_spawn() in a loop and I drive the rest of the communication in a thread. Basically it freeze in the same place as I see
[roberto@master TestOpenMPI]$ mpirun --verbose --debug-daemons -wdir "`pwd`" -np 1 testmaster 100000 $PBS_NODEFILE [master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received add_local_procs [master.tekno-soft.it:13161] [[2618,0],0] node[0].name master daemon 0 arch ffc91200 [master.tekno-soft.it:13161] [[2618,0],0] node[1].name cluster4 daemon INVALID arch ffc91200 [master.tekno-soft.it:13161] [[2618,0],0] node[2].name cluster3 daemon INVALID arch ffc91200 [master.tekno-soft.it:13161] [[2618,0],0] node[3].name cluster2 daemon INVALID arch ffc91200 [master.tekno-soft.it:13161] [[2618,0],0] node[4].name cluster1 daemon INVALID arch ffc91200 Initializing MPI ... [master.tekno-soft.it:13161] [[2618,0],0] orted_recv: received sync+nidmap from local proc [[2618,1],0] [master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received collective data cmd [master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received message_local_procs [master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received collective data cmd [master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received message_local_procs Loading the node's ring from file '/var/torque/aux//929.master.tekno-soft.it' ... adding node #1 host is 'cluster4.tekno-soft.it' ... adding node #2 host is 'cluster3.tekno-soft.it' ... adding node #3 host is 'cluster2.tekno-soft.it' ... adding node #4 host is 'cluster1.tekno-soft.it' A 4 node's ring has been made At least one node is available, let's start to distribute 100000 job across 4 nodes!!! Setting up the host as 'cluster4.tekno-soft.it' Setting the work directory as '/data/roberto/MPI/TestOpenMPI' Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it' Daemon was launched on cluster4.tekno-soft.it - beginning to initialize Daemon [[2618,0],1] checking in as pid 10029 on host cluster4.tekno-soft.it Daemon [[2618,0],1] not using static ports [cluster4.tekno-soft.it:10029] [[2618,0],1] orted: up and running - waiting for commands! [master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received add_local_procs [master.tekno-soft.it:13161] [[2618,0],0] node[0].name master daemon 0 arch ffc91200 [master.tekno-soft.it:13161] [[2618,0],0] node[1].name cluster4 daemon 1 arch ffc91200 [master.tekno-soft.it:13161] [[2618,0],0] node[2].name cluster3 daemon INVALID arch ffc91200 [master.tekno-soft.it:13161] [[2618,0],0] node[3].name cluster2 daemon INVALID arch ffc91200 [master.tekno-soft.it:13161] [[2618,0],0] node[4].name cluster1 daemon INVALID arch ffc91200 [cluster4.tekno-soft.it:10029] [[2618,0],1] orted_cmd: received add_local_procs [cluster4.tekno-soft.it:10029] [[2618,0],1] node[0].name master daemon 0 arch ffc91200 [cluster4.tekno-soft.it:10029] [[2618,0],1] node[1].name cluster4 daemon 1 arch ffc91200 [cluster4.tekno-soft.it:10029] [[2618,0],1] node[2].name cluster3 daemon INVALID arch ffc91200 [cluster4.tekno-soft.it:10029] [[2618,0],1] node[3].name cluster2 daemon INVALID arch ffc91200 [cluster4.tekno-soft.it:10029] [[2618,0],1] node[4].name cluster1 daemon INVALID arch ffc91200 [cluster4.tekno-soft.it:10029] [[2618,0],1] orted_recv: received sync+nidmap from local proc [[2618,2],0] [cluster4.tekno-soft.it:10029] [[2618,0],1] orted_cmd: received collective data cmd [master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received collective data cmd [master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received message_local_procs [cluster4.tekno-soft.it:10029] [[2618,0],1] orted_cmd: received message_local_procs [cluster4.tekno-soft.it:10029] [[2618,0],1] orted_cmd: received collective data cmd [master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received collective data cmd [master.tekno-soft.it:13161] [[2618,0],0] orted_cmd: received message_local_procs [cluster4.tekno-soft.it:10029] [[2618,0],1] orted_cmd: received message_local_procs Killed [cluster4.tekno-soft.it:10029] [[2618,0],1] routed:binomial: Connection to lifeline [[2618,0],0] lost [roberto@master TestOpenMPI]$