We have multiple queue instances on each node each with slots equal to the number of cpus. To prevent oversubscription I added a slots consumable to each host restricting it to a number of slots equal to the cpus on the node. This has worked up to now but this morning there are a couple of jobs that have managed to overconsume host slots using all the slots in two queues or twice the number defined for the node as a whole. Using qstat -f I see the following:
hc:slots=-4 Checking $SGE_ROOT/$SGE_CELL/spool/exec_hosts/<nodename> shows a modification time prior to the job starting and the right number of slots configured (as does the output of qconf -se hostname) so it's not a case of the value of the host consumable being reduced after job start. There are no other jobs running on these nodes. Any idea what could cause this? Or how to prevent it in future? _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
