We have multiple queue instances on each node each with slots equal to
the number of cpus.  To prevent oversubscription I added a slots
consumable to each host restricting it to a number of slots equal to
the cpus on the node.
This has worked up to now but this morning there are a couple of jobs
that have managed to overconsume host slots using all the slots in two
queues or twice the number defined for the node as a whole.  Using
qstat -f I see the following:

 hc:slots=-4

Checking $SGE_ROOT/$SGE_CELL/spool/exec_hosts/<nodename> shows a
modification time prior to the job starting and the right number of
slots configured (as does the output of qconf -se hostname) so it's
not a case of the
value of the host consumable being reduced after job start.  There are
no other jobs running on these nodes.

Any idea what could cause this?  Or how to prevent it in future?
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to