On 7 March 2012 09:47, William Hay <[email protected]> wrote:
> We have multiple queue instances on each node each with slots equal to
> the number of cpus.  To prevent oversubscription I added a slots
> consumable to each host restricting it to a number of slots equal to
> the cpus on the node.
> This has worked up to now but this morning there are a couple of jobs
> that have managed to overconsume host slots using all the slots in two
> queues or twice the number defined for the node as a whole.  Using
> qstat -f I see the following:
>
>  hc:slots=-4
>
> Checking $SGE_ROOT/$SGE_CELL/spool/exec_hosts/<nodename> shows a
> modification time prior to the job starting and the right number of
> slots configured (as does the output of qconf -se hostname) so it's
> not a case of the
> value of the host consumable being reduced after job start.  There are
> no other jobs running on these nodes.
>
> Any idea what could cause this?  Or how to prevent it in future?

A bit of extra information the schedule file contains the following
lines adjacent to each other (ie same scheduling cycle):
Note how each job reserves resources twice on the same node (including
wex which is an exclusive resource).

487297:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000
487297:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000
487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:tmpfs:42949672960.000000
487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:wex:4.000000
487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:memory:4294967296.000000
487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:threads:4.000000
487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:slots:4.000000
487297:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000
487297:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000
487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:tmpfs:42949672960.000000
487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:wex:4.000000
487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:memory:4294967296.000000
487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:threads:4.000000
487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:slots:4.000000
487297:1:RUNNING:1331037343:87300:G:global:penalty:10800.000000
487297:1:RUNNING:1331037343:87300:P:qlc-1:slots:8.000000
487295:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000
487295:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000
487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:tmpfs:42949672960.000000
487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:wex:4.000000
487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:memory:4294967296.000000
487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:threads:4.000000
487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:slots:4.000000
487295:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000
487295:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000
487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:tmpfs:42949672960.000000
487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:wex:4.000000
487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:memory:4294967296.000000
487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:threads:4.000000
487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:slots:4.000000
487295:1:RUNNING:1331037343:87300:G:global:penalty:10800.000000
487295:1:RUNNING:1331037343:87300:P:qlc-1:slots:8.000000

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to