[gridengine users] load_thresholds, load_scaling, and hyperthreading

Joshua Baker-LePain Wed, 02 Nov 2016 10:39:13 -0700

On our cluster, we have three queues per host, each with as many slots asthe host has physical cores. The queues are configured as follows:


 o lab.q (high priority queue for cluster "owners")
   - load_thresholds       np_load_avg=1.5
 o short.q (for jobs <30 minutes)
   - load_thresholds       np_load_avg=1.25
 o long.q (low priority queue avaialble to all users)
   - load_thresholds       np_load_avg=0.9

The theory is that we want long.q to stop accepting jobs when a node isfully loaded (read: load = physical core count) and short.q to stopaccepting jobs when when a node is 50% overloaded. This has worked wellfor a long while.

On nodes that support it (and not all of ours do), we leave hyperthreadingon as it is a net win on those nodes. As core counts have increased,though, a problem has become blindingly obvious -- the above schemedoesn't work anymore. long.q never goes into alarm mode since the loaddoesn't hit the NCPU reported by SGE. This is true on both OGS 2011.11p1and SoGE 8.1.9.

I thought I could fix this using load_scaling on the exec hosts withhyperthreading, but I can't get it to work. I try to define"load_avg=2" and/or "np_load_avg=2", but none of these configurations seemto have any effect. What am I doing wrong?


Thanks.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] load_thresholds, load_scaling, and hyperthreading

Reply via email to