Ralph Castain ha scritto:
> Interesting. I ran a loop calling comm_spawn 1000 times without a
> problem. I suspect it is the threading that is causing the trouble here.
I think so! My guessing is that at low level there is some trouble when
handling *concurrent*
orted spawning. Maybe
> You are welcome to send me the code. You can find my loop code in your
> code distribution under orte/test/mpi - look for loop_spawn and
> loop_child.
In the attached code the spawing logic is currently under a loop in the
main of the testmaster, so it's completly
unthreaded at least until the MPI_Comm_spawn() terminate its work. If
you wish like to test multithreading spawing
you can comment the NodeThread_spawnSlave() in the main loop and
uncomment the same function in the
NodeThread_threadMain(). Finally if you want multithreading spawning but
serialized against a mutex than uncomment
the pthread_mutex_lock/unlock() in the NodeThread_threadMain().

This code run *without* any trouble in the HP MPI implementation. It
works not so well in mpich2 trunk version due
to two problems: limit of ~24.4K context id and/or a race in poll()
while waiting a termination under MPI_Comm_disconnect()
concurrently with a MPI_Comm_spawn().

>
> Ralph
>
> On Oct 3, 2008, at 9:11 AM, Roberto Fichera wrote:
>
>> Ralph Castain ha scritto:
>>>
>>> On Oct 3, 2008, at 7:14 AM, Roberto Fichera wrote:
>>>
>>>> Ralph Castain ha scritto:
>>>>> I committed something to the trunk yesterday. Given the complexity of
>>>>> the fix, I don't plan to bring it over to the 1.3 branch until
>>>>> sometime mid-to-end next week so it can be adequately tested.
>>>> Ok! So it means that I can checkout from the SVN/trunk to get you fix,
>>>> right?
>>>
>>> Yes, though note that I don't claim it is fully correct yet. Still
>>> needs testing. However, I have tested it a fair amount and it seems
>>> okay.
>>>
>>> If you do test it, please let me know how it goes.
>> I execute my test on the svn/trunk below
>>
>>                Open MPI: 1.4a1r19677
>>   Open MPI SVN revision: r19677
>>   Open MPI release date: Unreleased developer copy
>>                Open RTE: 1.4a1r19677
>>   Open RTE SVN revision: r19677
>>   Open RTE release date: Unreleased developer copy
>>                    OPAL: 1.4a1r19677
>>       OPAL SVN revision: r19677
>>       OPAL release date: Unreleased developer copy
>>            Ident string: 1.4a1r19677
>>
>> below is the output which seems to freeze just after the second spawn.
>>
>> [roberto@master TestOpenMPI]$ mpirun --verbose --debug-daemons
>> --hostfile $PBS_NODEFILE -wdir "`pwd`" -np 1 testmaster 100000
>> $PBS_NODEFILE
>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>> add_local_procs
>> [master.tekno-soft.it:30063] [[19516,0],0] node[0].name master daemon 0
>> arch ffc91200
>> [master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4 daemon
>> INVALID arch ffc91200
>> [master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3 daemon
>> INVALID arch ffc91200
>> [master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2 daemon
>> INVALID arch ffc91200
>> [master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1 daemon
>> INVALID arch ffc91200
>> Initializing MPI ...
>> [master.tekno-soft.it:30063] [[19516,0],0] orted_recv: received
>> sync+nidmap from local proc [[19516,1],0]
>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>> message_local_procs
>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>> message_local_procs
>> Loading the node's ring from file
>> '/var/torque/aux//932.master.tekno-soft.it'
>> ... adding node #1 host is 'cluster4.tekno-soft.it'
>> ... adding node #2 host is 'cluster3.tekno-soft.it'
>> ... adding node #3 host is 'cluster2.tekno-soft.it'
>> ... adding node #4 host is 'cluster1.tekno-soft.it'
>> A 4 node's ring has been made
>> At least one node is available, let's start to distribute 100000 job
>> across 4 nodes!!!
>> Setting up the host as 'cluster4.tekno-soft.it'
>> Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
>> Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it'
>> Daemon was launched on cluster4.tekno-soft.it - beginning to initialize
>> Daemon [[19516,0],1] checking in as pid 25123 on host
>> cluster4.tekno-soft.it
>> Daemon [[19516,0],1] not using static ports
>> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted: up and running -
>> waiting for commands!
>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>> add_local_procs
>> [master.tekno-soft.it:30063] [[19516,0],0] node[0].name master daemon 0
>> arch ffc91200
>> [master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4 daemon
>> 1 arch ffc91200
>> [master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3 daemon
>> INVALID arch ffc91200
>> [master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2 daemon
>> INVALID arch ffc91200
>> [master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1 daemon
>> INVALID arch ffc91200
>> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
>> add_local_procs
>> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[0].name master daemon
>> 0 arch ffc91200
>> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[1].name cluster4
>> daemon 1 arch ffc91200
>> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[2].name cluster3
>> daemon INVALID arch ffc91200
>> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[3].name cluster2
>> daemon INVALID arch ffc91200
>> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[4].name cluster1
>> daemon INVALID arch ffc91200
>> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_recv: received
>> sync+nidmap from local proc [[19516,2],0]
>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>> message_local_procs
>> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
>> collective data cmd
>> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
>> message_local_procs
>> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>> collective data cmd
>> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
>> message_local_procs
>> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
>> message_local_procs
>>
>> Let me know if you need my test program.
>>
>>>
>>> Thanks
>>> Ralph
>>>
>>>>
>>>>> Ralph
>>>>>
>>>>> On Oct 3, 2008, at 5:02 AM, Roberto Fichera wrote:
>>>>>
>>>>>> Ralph Castain ha scritto:
>>>>>>> Actually, it just occurred to me that you may be seeing a
>>>>>>> problem in
>>>>>>> comm_spawn itself that I am currently chasing down. It is in the
>>>>>>> 1.3
>>>>>>> branch and has to do with comm_spawning procs on subsets of nodes
>>>>>>> (instead of across all nodes). Could be related to this - you might
>>>>>>> want to give me a chance to complete the fix. I have identified the
>>>>>>> problem and should have it fixed later today in our trunk -
>>>>>>> probably
>>>>>>> won't move to the 1.3 branch for several days.
>>>>>> Do you have any news about the above fix? Does the fix is already
>>>>>> available for testing?
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Attachment: testspawn.tar.bz2
Description: application/bzip

Reply via email to