On Wed, 2 Nov 2016 at 5:05pm, Reuti wrote

Just for the record: to investigate this, I defined a load_thresholds which is always putting the queue in alarm state besides the one under test. I used our tmpfree complex for it and entered a value which is beyond the installed disk. This way, `qstat -explain a` will always give an output, even the values of other complexes which aren't bypassed are displayed. I got:

$ qstat -explain a -q serial@node29 -s r
queuename                      qtype resv/used/tot. load_avg arch          
states
---------------------------------------------------------------------------------
serial@node29                  B     0/0/16         15.75    lx24-em64t    a
        alarm hl:tmpfree=1842222120k load-threshold=2T
        alarm hl:np_load_avg=0.492188 load-threshold=0.5

$ qstat -explain a -q serial@node29 -s r
queuename                      qtype resv/used/tot. load_avg arch          
states
---------------------------------------------------------------------------------
serial@node29                  B     0/0/16         15.75    lx24-em64t    a
        alarm hl:tmpfree=1842222120k load-threshold=2T
        alarm hl:np_load_avg=   9.844 load-threshold=0.5

$ qstat -explain a -q serial@node29 -s r
queuename                      qtype resv/used/tot. load_avg arch          
states
---------------------------------------------------------------------------------
serial@node29                  B     0/0/16         15.76    lx24-em64t    a
        alarm hl:tmpfree=1842221988k load-threshold=2T
        alarm hl:np_load_avg=   0.246 load-threshold=0.5

for settings of NONE or 20 and 0.5 in the load_scaling of np_load_avg of the exechost. Looks fine. Hence your np_load_avg=2 should have worked.

The plot thickens. Doing similar testing to yours, it looks like this is a display bug with qhost. Here are 2 configurations that both create an alarm state, but in one the alarm doesn't show up in the output of 'qhost':

Config 1:
$ qconf -sq long.q
load_thresholds       np_load_avg=0.5
$ qconf -se msg-id1
load_scaling          NONE
$ qhost -q -h msg-id1
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  
SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
msg-id1                 lx-amd64       48    2   24   48 24.63  251.6G    2.2G  
  4.0G     0.0
   member.q             BP    0/24/24
   short.q              BP    0/0/24
   long.q               BP    0/0/24        a
$ qstat -explain a -q long.q@msg-id1 -s r
queuename                      qtype resv/used/tot. load_avg arch          
states
---------------------------------------------------------------------------------
lon...@msg-id1.ic.ucsf.edu     BP    0/0/24         24.62    lx-amd64      a
        alarm hl:np_load_avg=0.512917 load-threshold=0.5

Config 2:
$ qconf -sq long.q
load_thresholds       np_load_avg=0.9
$ qconf -se msg-id1
load_scaling          np_load_avg=2.000000
$ qhost -q -h msg-id1
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  
SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
msg-id1                 lx-amd64       48    2   24   48 24.58  251.6G    2.2G  
  4.0G     0.0
   member.q             BP    0/24/24
   short.q              BP    0/0/24
   long.q               BP    0/0/24
$ qstat -explain a -q long.q@msg-id1 -s r
queuename                      qtype resv/used/tot. load_avg arch          
states
---------------------------------------------------------------------------------
lon...@msg-id1.ic.ucsf.edu     BP    0/0/24         24.58    lx-amd64      a
        alarm hl:np_load_avg=   1.024 load-threshold=0.9

In both configs, long.q correctly refuses to accept jobs. But the qhost display error is sure to confuse users as to why that is. I'm going to stick with the previous solution, but I'll file a bug to try to get things fixed up.

Thanks again for all your help, Reuti.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to