On our cluster, we have three queues per host, each with as many slots as
the host has physical cores. The queues are configured as follows:
o lab.q (high priority queue for cluster "owners")
- load_thresholds np_load_avg=1.5
o short.q (for jobs <30 minutes)
- load_thresholds np_load_avg=1.25
o long.q (low priority queue avaialble to all users)
- load_thresholds np_load_avg=0.9
The theory is that we want long.q to stop accepting jobs when a node is
fully loaded (read: load = physical core count) and short.q to stop
accepting jobs when when a node is 50% overloaded. This has worked well
for a long while.
On nodes that support it (and not all of ours do), we leave hyperthreading
on as it is a net win on those nodes. As core counts have increased,
though, a problem has become blindingly obvious -- the above scheme
doesn't work anymore. long.q never goes into alarm mode since the load
doesn't hit the NCPU reported by SGE. This is true on both OGS 2011.11p1
and SoGE 8.1.9.
I thought I could fix this using load_scaling on the exec hosts with
hyperthreading, but I can't get it to work. I try to define
"load_avg=2" and/or "np_load_avg=2", but none of these configurations seem
to have any effect. What am I doing wrong?
Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users