On our cluster, we have three queues per host, each with as many slots as the host has physical cores. The queues are configured as follows:

 o lab.q (high priority queue for cluster "owners")
   - load_thresholds       np_load_avg=1.5
 o short.q (for jobs <30 minutes)
   - load_thresholds       np_load_avg=1.25
 o long.q (low priority queue avaialble to all users)
   - load_thresholds       np_load_avg=0.9

The theory is that we want long.q to stop accepting jobs when a node is fully loaded (read: load = physical core count) and short.q to stop accepting jobs when when a node is 50% overloaded. This has worked well for a long while.

On nodes that support it (and not all of ours do), we leave hyperthreading on as it is a net win on those nodes. As core counts have increased, though, a problem has become blindingly obvious -- the above scheme doesn't work anymore. long.q never goes into alarm mode since the load doesn't hit the NCPU reported by SGE. This is true on both OGS 2011.11p1 and SoGE 8.1.9.

I thought I could fix this using load_scaling on the exec hosts with hyperthreading, but I can't get it to work. I try to define "load_avg=2" and/or "np_load_avg=2", but none of these configurations seem to have any effect. What am I doing wrong?

Thanks.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to