On 7 March 2012 09:47, William Hay <[email protected]> wrote: > We have multiple queue instances on each node each with slots equal to > the number of cpus. To prevent oversubscription I added a slots > consumable to each host restricting it to a number of slots equal to > the cpus on the node. > This has worked up to now but this morning there are a couple of jobs > that have managed to overconsume host slots using all the slots in two > queues or twice the number defined for the node as a whole. Using > qstat -f I see the following: > > hc:slots=-4 > > Checking $SGE_ROOT/$SGE_CELL/spool/exec_hosts/<nodename> shows a > modification time prior to the job starting and the right number of > slots configured (as does the output of qconf -se hostname) so it's > not a case of the > value of the host consumable being reduced after job start. There are > no other jobs running on these nodes. > > Any idea what could cause this? Or how to prevent it in future?
A bit of extra information the schedule file contains the following lines adjacent to each other (ie same scheduling cycle): Note how each job reserves resources twice on the same node (including wex which is an exclusive resource). 487297:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000 487297:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:tmpfs:42949672960.000000 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:wex:4.000000 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:memory:4294967296.000000 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:threads:4.000000 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:slots:4.000000 487297:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000 487297:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:tmpfs:42949672960.000000 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:wex:4.000000 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:memory:4294967296.000000 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:threads:4.000000 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:slots:4.000000 487297:1:RUNNING:1331037343:87300:G:global:penalty:10800.000000 487297:1:RUNNING:1331037343:87300:P:qlc-1:slots:8.000000 487295:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000 487295:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:tmpfs:42949672960.000000 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:wex:4.000000 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:memory:4294967296.000000 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:threads:4.000000 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:slots:4.000000 487295:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000 487295:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:tmpfs:42949672960.000000 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:wex:4.000000 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:memory:4294967296.000000 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:threads:4.000000 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:slots:4.000000 487295:1:RUNNING:1331037343:87300:G:global:penalty:10800.000000 487295:1:RUNNING:1331037343:87300:P:qlc-1:slots:8.000000 _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
