On Wed, 14 Mar 2012 at 6:31pm, Reuti wrote

I just tested with two different queues on two machines and a small mpihello and it is working as expected.

At this point the narrative is getting very confused, even for me. So I tried to find a clear cut case where I can change one thing to flip between "it works" and "it doesn't":

Case "it works":
 o Setup 2 queues -- lab.q and test.q.  Both run at priority 0.  lab.q has
   slots=cores on each host, test.q has 1 slot per host.

 o Submit job via:
   qsub -q "lab.q|test.q" -l mem_free=150M -pe ompi 64 jobscript.sh

 o Job runs just fine.  Running 'ps aufx' on one of the nodes shows 2 orted
   processes, one with 4 children (the proceses running in the lab.q
   slots) and one with 1 child (the process running in the test.q slot),
   all happily running (caution: very long lines ahead):

sge       9673  0.0  0.0  14224  1204 ?        S    14:31   0:00  \_ 
sge_shepherd-6997934 -bg
root      9674  0.0  0.0  11272   892 ?        Ss   14:31   0:00  |   \_ 
/ccpr1/sge6/utilbin/lx24-amd64/rshd -l
jlb       9677  0.0  0.0   8988   700 ?        S    14:31   0:00  |       \_ 
/ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter 
/var/spool/sge/opt95/active_jobs/6997934.1/1.opt95
jlb       9679  0.1  0.0  47932  2008 ?        S    14:31   0:00  |           
\_ orted -mca ess env -mca orte_ess_jobid 1517355008 -mca orte_ess_vpid 5 -mca 
orte_ess_num_procs 24 --hnp-uri 1517355008.0;tcp://172.19.12.104:47527
jlb       9690 53.6  0.0 157376  3832 ?        R    14:31   0:02  |             
  \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
jlb       9691 50.8  0.0 157376  3832 ?        R    14:31   0:02  |             
  \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
jlb       9692 37.0  0.0 157376  3828 ?        R    14:31   0:01  |             
  \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
jlb       9693 49.2  0.0 157376  3824 ?        R    14:31   0:02  |             
  \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
sge       9675  0.0  0.0  14228  1208 ?        S    14:31   0:00  \_ 
sge_shepherd-6997934 -bg
root      9676  0.0  0.0  11268   888 ?        Ss   14:31   0:00      \_ 
/ccpr1/sge6/utilbin/lx24-amd64/rshd -l
jlb       9678  0.0  0.0   8992   708 ?        S    14:31   0:00          \_ 
/ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter 
/var/spool/sge/opt95/active_jobs/6997934.1/2.opt95
jlb       9680  0.0  0.0  47932  2000 ?        S    14:31   0:00              
\_ orted -mca ess env -mca orte_ess_jobid 1517355008 -mca orte_ess_vpid 6 -mca 
orte_ess_num_procs 24 --hnp-uri 1517355008.0;tcp://172.19.12.104:47527
jlb       9689 36.8  0.0  89776  3672 ?        R    14:31   0:01                
  \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug

Case "it doesn't":
 o Take the above queue setup, and simply change test.q to have 2 slots
   per host.

 o Submit job with the same qsub line.

 o Job crashes.  I had 'ps aufx' running in a continuous loop on one of the
   nodes.  This was the last output which showed the job processes.  Note
   that the actually mpihello processes never got into the "R" state:

sge      12423  0.0  0.0  14224  1196 ?        S    14:41   0:00  \_ 
sge_shepherd-6997938 -bg
root     12425  0.0  0.0  11272   896 ?        Ss   14:41   0:00  |   \_ 
/ccpr1/sge6/utilbin/lx24-amd64/rshd -l
jlb      12428  0.0  0.0   8988   700 ?        S    14:41   0:00  |       \_ 
/ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter 
/var/spool/sge/opt65/active_jobs/6997938.1/1.opt65
jlb      12430  0.0  0.0  47932  2016 ?        S    14:41   0:00  |           
\_ orted -mca ess env -mca orte_ess_jobid 1468006400 -mca orte_ess_vpid 7 -mca 
orte_ess_num_procs 20 --hnp-uri 1468006400.0;tcp://172.19.12.104:39940
jlb      12798  1.0  0.0 153244  3752 ?        S    14:41   0:00  |             
  \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
jlb      12799  2.0  0.0 153244  3752 ?        S    14:41   0:00  |             
  \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
jlb      12800  1.0  0.0 153244  3752 ?        S    14:41   0:00  |             
  \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
sge      12436  0.0  0.0  14228  1208 ?        S    14:41   0:00  \_ 
sge_shepherd-6997938 -bg
root     12437  0.0  0.0  11268   884 ?        Ss   14:41   0:00      \_ 
/ccpr1/sge6/utilbin/lx24-amd64/rshd -l
jlb      12439  0.0  0.0   8992   712 ?        S    14:41   0:00          \_ 
/ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter 
/var/spool/sge/opt65/active_jobs/6997938.1/2.opt65
jlb      12441  0.1  0.0  47932  2012 ?        S    14:41   0:00              
\_ orted -mca ess env -mca orte_ess_jobid 1468006400 -mca orte_ess_vpid 8 -mca 
orte_ess_num_procs 20 --hnp-uri 1468006400.0;tcp://172.19.12.104:39940
jlb      12795  1.0  0.0 153100  3128 ?        S    14:41   0:00                
  \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
jlb      12796  2.0  0.0 153232  3752 ?        S    14:41   0:00                
  \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug


Joshua: the Centos6 is the same on all nodes and the you recompiled the application with the actual version of the library? By "threads" you refer to "processes"?

All the nodes are installed from the same kickstart file and kept fully
up to date.  And, yes, the application is compiled against the exact
library I'm running it with.

Thanks again to all for looking at this.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF

Reply via email to