Ralph Castain ha scritto: > Interesting. I ran a loop calling comm_spawn 1000 times without a > problem. I suspect it is the threading that is causing the trouble here. I think so! My guessing is that at low level there is some trouble when handling *concurrent* orted spawning. Maybe > You are welcome to send me the code. You can find my loop code in your > code distribution under orte/test/mpi - look for loop_spawn and > loop_child. In the attached code the spawing logic is currently under a loop in the main of the testmaster, so it's completly unthreaded at least until the MPI_Comm_spawn() terminate its work. If you wish like to test multithreading spawing you can comment the NodeThread_spawnSlave() in the main loop and uncomment the same function in the NodeThread_threadMain(). Finally if you want multithreading spawning but serialized against a mutex than uncomment the pthread_mutex_lock/unlock() in the NodeThread_threadMain().
This code run *without* any trouble in the HP MPI implementation. It works not so well in mpich2 trunk version due to two problems: limit of ~24.4K context id and/or a race in poll() while waiting a termination under MPI_Comm_disconnect() concurrently with a MPI_Comm_spawn(). > > Ralph > > On Oct 3, 2008, at 9:11 AM, Roberto Fichera wrote: > >> Ralph Castain ha scritto: >>> >>> On Oct 3, 2008, at 7:14 AM, Roberto Fichera wrote: >>> >>>> Ralph Castain ha scritto: >>>>> I committed something to the trunk yesterday. Given the complexity of >>>>> the fix, I don't plan to bring it over to the 1.3 branch until >>>>> sometime mid-to-end next week so it can be adequately tested. >>>> Ok! So it means that I can checkout from the SVN/trunk to get you fix, >>>> right? >>> >>> Yes, though note that I don't claim it is fully correct yet. Still >>> needs testing. However, I have tested it a fair amount and it seems >>> okay. >>> >>> If you do test it, please let me know how it goes. >> I execute my test on the svn/trunk below >> >> Open MPI: 1.4a1r19677 >> Open MPI SVN revision: r19677 >> Open MPI release date: Unreleased developer copy >> Open RTE: 1.4a1r19677 >> Open RTE SVN revision: r19677 >> Open RTE release date: Unreleased developer copy >> OPAL: 1.4a1r19677 >> OPAL SVN revision: r19677 >> OPAL release date: Unreleased developer copy >> Ident string: 1.4a1r19677 >> >> below is the output which seems to freeze just after the second spawn. >> >> [roberto@master TestOpenMPI]$ mpirun --verbose --debug-daemons >> --hostfile $PBS_NODEFILE -wdir "`pwd`" -np 1 testmaster 100000 >> $PBS_NODEFILE >> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received >> add_local_procs >> [master.tekno-soft.it:30063] [[19516,0],0] node[0].name master daemon 0 >> arch ffc91200 >> [master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4 daemon >> INVALID arch ffc91200 >> [master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3 daemon >> INVALID arch ffc91200 >> [master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2 daemon >> INVALID arch ffc91200 >> [master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1 daemon >> INVALID arch ffc91200 >> Initializing MPI ... >> [master.tekno-soft.it:30063] [[19516,0],0] orted_recv: received >> sync+nidmap from local proc [[19516,1],0] >> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received >> collective data cmd >> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received >> message_local_procs >> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received >> collective data cmd >> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received >> message_local_procs >> Loading the node's ring from file >> '/var/torque/aux//932.master.tekno-soft.it' >> ... adding node #1 host is 'cluster4.tekno-soft.it' >> ... adding node #2 host is 'cluster3.tekno-soft.it' >> ... adding node #3 host is 'cluster2.tekno-soft.it' >> ... adding node #4 host is 'cluster1.tekno-soft.it' >> A 4 node's ring has been made >> At least one node is available, let's start to distribute 100000 job >> across 4 nodes!!! >> Setting up the host as 'cluster4.tekno-soft.it' >> Setting the work directory as '/data/roberto/MPI/TestOpenMPI' >> Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it' >> Daemon was launched on cluster4.tekno-soft.it - beginning to initialize >> Daemon [[19516,0],1] checking in as pid 25123 on host >> cluster4.tekno-soft.it >> Daemon [[19516,0],1] not using static ports >> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted: up and running - >> waiting for commands! >> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received >> add_local_procs >> [master.tekno-soft.it:30063] [[19516,0],0] node[0].name master daemon 0 >> arch ffc91200 >> [master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4 daemon >> 1 arch ffc91200 >> [master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3 daemon >> INVALID arch ffc91200 >> [master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2 daemon >> INVALID arch ffc91200 >> [master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1 daemon >> INVALID arch ffc91200 >> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received >> add_local_procs >> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[0].name master daemon >> 0 arch ffc91200 >> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[1].name cluster4 >> daemon 1 arch ffc91200 >> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[2].name cluster3 >> daemon INVALID arch ffc91200 >> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[3].name cluster2 >> daemon INVALID arch ffc91200 >> [cluster4.tekno-soft.it:25123] [[19516,0],1] node[4].name cluster1 >> daemon INVALID arch ffc91200 >> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_recv: received >> sync+nidmap from local proc [[19516,2],0] >> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received >> collective data cmd >> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received >> message_local_procs >> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received >> collective data cmd >> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received >> message_local_procs >> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received >> collective data cmd >> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received >> collective data cmd >> [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received >> message_local_procs >> [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received >> message_local_procs >> >> Let me know if you need my test program. >> >>> >>> Thanks >>> Ralph >>> >>>> >>>>> Ralph >>>>> >>>>> On Oct 3, 2008, at 5:02 AM, Roberto Fichera wrote: >>>>> >>>>>> Ralph Castain ha scritto: >>>>>>> Actually, it just occurred to me that you may be seeing a >>>>>>> problem in >>>>>>> comm_spawn itself that I am currently chasing down. It is in the >>>>>>> 1.3 >>>>>>> branch and has to do with comm_spawning procs on subsets of nodes >>>>>>> (instead of across all nodes). Could be related to this - you might >>>>>>> want to give me a chance to complete the fix. I have identified the >>>>>>> problem and should have it fixed later today in our trunk - >>>>>>> probably >>>>>>> won't move to the 1.3 branch for several days. >>>>>> Do you have any news about the above fix? Does the fix is already >>>>>> available for testing? >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
testspawn.tar.bz2
Description: application/bzip