Re: [gridengine users] Job overconsuming slots

Reuti Wed, 07 Mar 2012 03:37:44 -0800

Am 07.03.2012 um 11:18 schrieb William Hay:

> On 7 March 2012 10:11, Mazouzi <[email protected]> wrote:
>> I remember Reuti  proposed a solution using RQS:
>> 
>>  {
>>    name         noverload
>>    description  Make sure host will not take more than 1 process per
>> processor
>>    enabled      TRUE
>>    limit        hosts {*} to slots=$num_proc
>> }
>> 
> While I could switch to using RQS rather than host consumables to
> control slot usage I'd rather understand why a solution that  AFAICT
> worked perfectly up to now has stopped working for these two
> jobs/hosts.  Without that understanding I have no guarantee that the
> RQS solution won't have the same issue.  Is there any reason to
> believe the RQS solution will be more reliable than the host
> consumable solution (which has worked pretty well up to now)?


Yes, it could be set up in an RQS too, it's mainly a matter of taste. Attaching 
it to a node makes the output of `qquota` shorter to show the real limits but 
it must be done for each machine by hand or script.

To the real issue: 

There was no change and it happened out of the blue?

Do you request a load value in addition during submission?

https://arc.liv.ac.uk/trac/SGE/ticket/1316

-- Reuti


>> Regards,
>> On Wed, Mar 7, 2012 at 11:00 AM, William Hay <[email protected]> wrote:
>>> 
>>> On 7 March 2012 09:47, William Hay <[email protected]> wrote:
>>>> We have multiple queue instances on each node each with slots equal to
>>>> the number of cpus.  To prevent oversubscription I added a slots
>>>> consumable to each host restricting it to a number of slots equal to
>>>> the cpus on the node.
>>>> This has worked up to now but this morning there are a couple of jobs
>>>> that have managed to overconsume host slots using all the slots in two
>>>> queues or twice the number defined for the node as a whole.  Using
>>>> qstat -f I see the following:
>>>> 
>>>>  hc:slots=-4
>>>> 
>>>> Checking $SGE_ROOT/$SGE_CELL/spool/exec_hosts/<nodename> shows a
>>>> modification time prior to the job starting and the right number of
>>>> slots configured (as does the output of qconf -se hostname) so it's
>>>> not a case of the
>>>> value of the host consumable being reduced after job start.  There are
>>>> no other jobs running on these nodes.
>>>> 
>>>> Any idea what could cause this?  Or how to prevent it in future?
>>> 
>>> A bit of extra information the schedule file contains the following
>>> lines adjacent to each other (ie same scheduling cycle):
>>> Note how each job reserves resources twice on the same node (including
>>> wex which is an exclusive resource).
>>> 
>>> 
>>> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000
>>> 
>>> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000
>>> 
>>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:tmpfs:42949672960.000000
>>> 
>>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:wex:4.000000
>>> 
>>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:memory:4294967296.000000
>>> 
>>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:threads:4.000000
>>> 
>>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:slots:4.000000
>>> 
>>> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000
>>> 
>>> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000
>>> 
>>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:tmpfs:42949672960.000000
>>> 
>>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:wex:4.000000
>>> 
>>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:memory:4294967296.000000
>>> 
>>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:threads:4.000000
>>> 
>>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:slots:4.000000
>>> 487297:1:RUNNING:1331037343:87300:G:global:penalty:10800.000000
>>> 487297:1:RUNNING:1331037343:87300:P:qlc-1:slots:8.000000
>>> 
>>> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000
>>> 
>>> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000
>>> 
>>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:tmpfs:42949672960.000000
>>> 
>>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:wex:4.000000
>>> 
>>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:memory:4294967296.000000
>>> 
>>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:threads:4.000000
>>> 
>>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:slots:4.000000
>>> 
>>> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000
>>> 
>>> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000
>>> 
>>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:tmpfs:42949672960.000000
>>> 
>>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:wex:4.000000
>>> 
>>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:memory:4294967296.000000
>>> 
>>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:threads:4.000000
>>> 
>>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:slots:4.000000
>>> 487295:1:RUNNING:1331037343:87300:G:global:penalty:10800.000000
>>> 487295:1:RUNNING:1331037343:87300:P:qlc-1:slots:8.000000
>>> 
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>> 
>> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Job overconsuming slots

Reply via email to