I remember Reuti  proposed a solution using RQS:

 {
   name         noverload
   description  Make sure host will not take more than 1 process per
processor
   enabled      TRUE
   limit        hosts {*} to slots=$num_proc
}

Regards,
On Wed, Mar 7, 2012 at 11:00 AM, William Hay <[email protected]> wrote:

> On 7 March 2012 09:47, William Hay <[email protected]> wrote:
> > We have multiple queue instances on each node each with slots equal to
> > the number of cpus.  To prevent oversubscription I added a slots
> > consumable to each host restricting it to a number of slots equal to
> > the cpus on the node.
> > This has worked up to now but this morning there are a couple of jobs
> > that have managed to overconsume host slots using all the slots in two
> > queues or twice the number defined for the node as a whole.  Using
> > qstat -f I see the following:
> >
> >  hc:slots=-4
> >
> > Checking $SGE_ROOT/$SGE_CELL/spool/exec_hosts/<nodename> shows a
> > modification time prior to the job starting and the right number of
> > slots configured (as does the output of qconf -se hostname) so it's
> > not a case of the
> > value of the host consumable being reduced after job start.  There are
> > no other jobs running on these nodes.
> >
> > Any idea what could cause this?  Or how to prevent it in future?
>
> A bit of extra information the schedule file contains the following
> lines adjacent to each other (ie same scheduling cycle):
> Note how each job reserves resources twice on the same node (including
> wex which is an exclusive resource).
>
> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:
> slots:4.000000
> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:
> xex:4.000000
> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:
> tmpfs:42949672960.000000
> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:
> wex:4.000000
> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:
> memory:4294967296.000000
> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:
> threads:4.000000
> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:
> slots:4.000000
>
> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:
> slots:4.000000
>
> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:
> xex:4.000000
> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:
> tmpfs:42949672960.000000
> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:
> wex:4.000000
> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:
> memory:4294967296.000000
> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:
> threads:4.000000
> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:
> slots:4.000000
> 487297:1:RUNNING:1331037343:87300:G:global:penalty:10800.000000
> 487297:1:RUNNING:1331037343:87300:P:qlc-1:slots:8.000000
> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:
> slots:4.000000
> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:
> xex:4.000000
> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:
> tmpfs:42949672960.000000
> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:
> wex:4.000000
> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:
> memory:4294967296.000000
> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:
> threads:4.000000
> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:
> slots:4.000000
>
> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:
> slots:4.000000
>
> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:
> xex:4.000000
> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:
> tmpfs:42949672960.000000
> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:
> wex:4.000000
> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:
> memory:4294967296.000000
> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:
> threads:4.000000
> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:
> slots:4.000000
> 487295:1:RUNNING:1331037343:87300:G:global:penalty:10800.000000
> 487295:1:RUNNING:1331037343:87300:P:qlc-1:slots:8.000000
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to