On Mar 14, 2012, at 5:44 PM, Reuti wrote: > Am 14.03.2012 um 23:48 schrieb Joshua Baker-LePain: > >> On Wed, 14 Mar 2012 at 6:31pm, Reuti wrote >> >>> I just tested with two different queues on two machines and a small >>> mpihello and it is working as expected. >> >> At this point the narrative is getting very confused, even for me. So I >> tried to find a clear cut case where I can change one thing to flip between >> "it works" and "it doesn't": >> >> Case "it works": >> o Setup 2 queues -- lab.q and test.q. Both run at priority 0. lab.q has >> slots=cores on each host, test.q has 1 slot per host. >> >> o Submit job via: >> qsub -q "lab.q|test.q" -l mem_free=150M -pe ompi 64 jobscript.sh >> >> o Job runs just fine. Running 'ps aufx' on one of the nodes shows 2 orted >> processes, one with 4 children (the proceses running in the lab.q >> slots) and one with 1 child (the process running in the test.q slot), >> all happily running (caution: very long lines ahead): >> >> sge 9673 0.0 0.0 14224 1204 ? S 14:31 0:00 \_ >> sge_shepherd-6997934 -bg >> root 9674 0.0 0.0 11272 892 ? Ss 14:31 0:00 | \_ >> /ccpr1/sge6/utilbin/lx24-amd64/rshd -l > > Which version of SGE are you using? The traditional rsh startup was replaced > by the builtin startup some time ago (although it should still work). > > >> jlb 9677 0.0 0.0 8988 700 ? S 14:31 0:00 | \_ >> /ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter >> /var/spool/sge/opt95/active_jobs/6997934.1/1.opt95 >> jlb 9679 0.1 0.0 47932 2008 ? S 14:31 0:00 | >> \_ orted -mca ess env -mca orte_ess_jobid 1517355008 -mca orte_ess_vpid 5 >> -mca orte_ess_num_procs 24 --hnp-uri 1517355008.0;tcp://172.19.12.104:47527 >> jlb 9690 53.6 0.0 157376 3832 ? R 14:31 0:02 | >> \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug >> jlb 9691 50.8 0.0 157376 3832 ? R 14:31 0:02 | >> \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug >> jlb 9692 37.0 0.0 157376 3828 ? R 14:31 0:01 | >> \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug >> jlb 9693 49.2 0.0 157376 3824 ? R 14:31 0:02 | >> \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug >> sge 9675 0.0 0.0 14228 1208 ? S 14:31 0:00 \_ >> sge_shepherd-6997934 -bg >> root 9676 0.0 0.0 11268 888 ? Ss 14:31 0:00 \_ >> /ccpr1/sge6/utilbin/lx24-amd64/rshd -l >> jlb 9678 0.0 0.0 8992 708 ? S 14:31 0:00 \_ >> /ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter >> /var/spool/sge/opt95/active_jobs/6997934.1/2.opt95 >> jlb 9680 0.0 0.0 47932 2000 ? S 14:31 0:00 >> \_ orted -mca ess env -mca orte_ess_jobid 1517355008 -mca orte_ess_vpid 6 >> -mca orte_ess_num_procs 24 --hnp-uri 1517355008.0;tcp://172.19.12.104:47527 >> jlb 9689 36.8 0.0 89776 3672 ? R 14:31 0:01 >> \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug > > Maybe this shows already the problem: there are two `qrsh -inherit`, as Open > MPI thinks these are different machines (I ran only with one slot on each > host hence didn't get it first but can reproduce it now). But for SGE both > may end up in the same queue overriding the openmpi-session in $TMPDIR. > > Although it's running: you get all output? If I request 4 slots and get one > from each queue on both machines the mpihello outputs only 3 lines: the > "Hello World from Node 3" is always missing. > > (I was just typing when Ralph's message came in: I can confirm this. To avoid > it, it would mean for Open MPI to collect all lines from the hostfile which > are on the same machine. SGE creates entries for each queue/host pair in the > machine file).
Hmmm…I can take a look at the allocator module and see why we aren't doing it. Would the host names be the same for the two queues? > > == > > @Ralph: it could work if SGE would have a facility to request the desired > queue in `qrsh -inherit ...`, because then the $TMPDIR would be unique for > each orted again (assuming its using different ports for each). Gotcha! I suspect getting the allocator to handle this cleanly is the better solution, though. > > -- Reuti > > >> Case "it doesn't": >> o Take the above queue setup, and simply change test.q to have 2 slots >> per host. >> >> o Submit job with the same qsub line. >> >> o Job crashes. I had 'ps aufx' running in a continuous loop on one of the >> nodes. This was the last output which showed the job processes. Note >> that the actually mpihello processes never got into the "R" state: >> >> sge 12423 0.0 0.0 14224 1196 ? S 14:41 0:00 \_ >> sge_shepherd-6997938 -bg >> root 12425 0.0 0.0 11272 896 ? Ss 14:41 0:00 | \_ >> /ccpr1/sge6/utilbin/lx24-amd64/rshd -l >> jlb 12428 0.0 0.0 8988 700 ? S 14:41 0:00 | \_ >> /ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter >> /var/spool/sge/opt65/active_jobs/6997938.1/1.opt65 >> jlb 12430 0.0 0.0 47932 2016 ? S 14:41 0:00 | >> \_ orted -mca ess env -mca orte_ess_jobid 1468006400 -mca orte_ess_vpid 7 >> -mca orte_ess_num_procs 20 --hnp-uri 1468006400.0;tcp://172.19.12.104:39940 >> jlb 12798 1.0 0.0 153244 3752 ? S 14:41 0:00 | >> \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug >> jlb 12799 2.0 0.0 153244 3752 ? S 14:41 0:00 | >> \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug >> jlb 12800 1.0 0.0 153244 3752 ? S 14:41 0:00 | >> \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug >> sge 12436 0.0 0.0 14228 1208 ? S 14:41 0:00 \_ >> sge_shepherd-6997938 -bg >> root 12437 0.0 0.0 11268 884 ? Ss 14:41 0:00 \_ >> /ccpr1/sge6/utilbin/lx24-amd64/rshd -l >> jlb 12439 0.0 0.0 8992 712 ? S 14:41 0:00 \_ >> /ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter >> /var/spool/sge/opt65/active_jobs/6997938.1/2.opt65 >> jlb 12441 0.1 0.0 47932 2012 ? S 14:41 0:00 >> \_ orted -mca ess env -mca orte_ess_jobid 1468006400 -mca orte_ess_vpid 8 >> -mca orte_ess_num_procs 20 --hnp-uri 1468006400.0;tcp://172.19.12.104:39940 >> jlb 12795 1.0 0.0 153100 3128 ? S 14:41 0:00 >> \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug >> jlb 12796 2.0 0.0 153232 3752 ? S 14:41 0:00 >> \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug > > > >>> Joshua: the Centos6 is the same on all nodes and the you recompiled the >>> application with the actual version of the library? By "threads" you refer >>> to "processes"? >> >> All the nodes are installed from the same kickstart file and kept fully >> up to date. And, yes, the application is compiled against the exact >> library I'm running it with. >> >> Thanks again to all for looking at this. >> >> -- >> Joshua Baker-LePain >> QB3 Shared Cluster Sysadmin >> UCSF >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users