Am 07.03.2012 um 11:18 schrieb William Hay: > On 7 March 2012 10:11, Mazouzi <[email protected]> wrote: >> I remember Reuti proposed a solution using RQS: >> >> { >> name noverload >> description Make sure host will not take more than 1 process per >> processor >> enabled TRUE >> limit hosts {*} to slots=$num_proc >> } >> > While I could switch to using RQS rather than host consumables to > control slot usage I'd rather understand why a solution that AFAICT > worked perfectly up to now has stopped working for these two > jobs/hosts. Without that understanding I have no guarantee that the > RQS solution won't have the same issue. Is there any reason to > believe the RQS solution will be more reliable than the host > consumable solution (which has worked pretty well up to now)?
Yes, it could be set up in an RQS too, it's mainly a matter of taste. Attaching it to a node makes the output of `qquota` shorter to show the real limits but it must be done for each machine by hand or script. To the real issue: There was no change and it happened out of the blue? Do you request a load value in addition during submission? https://arc.liv.ac.uk/trac/SGE/ticket/1316 -- Reuti >> Regards, >> On Wed, Mar 7, 2012 at 11:00 AM, William Hay <[email protected]> wrote: >>> >>> On 7 March 2012 09:47, William Hay <[email protected]> wrote: >>>> We have multiple queue instances on each node each with slots equal to >>>> the number of cpus. To prevent oversubscription I added a slots >>>> consumable to each host restricting it to a number of slots equal to >>>> the cpus on the node. >>>> This has worked up to now but this morning there are a couple of jobs >>>> that have managed to overconsume host slots using all the slots in two >>>> queues or twice the number defined for the node as a whole. Using >>>> qstat -f I see the following: >>>> >>>> hc:slots=-4 >>>> >>>> Checking $SGE_ROOT/$SGE_CELL/spool/exec_hosts/<nodename> shows a >>>> modification time prior to the job starting and the right number of >>>> slots configured (as does the output of qconf -se hostname) so it's >>>> not a case of the >>>> value of the host consumable being reduced after job start. There are >>>> no other jobs running on these nodes. >>>> >>>> Any idea what could cause this? Or how to prevent it in future? >>> >>> A bit of extra information the schedule file contains the following >>> lines adjacent to each other (ie same scheduling cycle): >>> Note how each job reserves resources twice on the same node (including >>> wex which is an exclusive resource). >>> >>> >>> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000 >>> >>> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000 >>> >>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:tmpfs:42949672960.000000 >>> >>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:wex:4.000000 >>> >>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:memory:4294967296.000000 >>> >>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:threads:4.000000 >>> >>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:slots:4.000000 >>> >>> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000 >>> >>> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000 >>> >>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:tmpfs:42949672960.000000 >>> >>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:wex:4.000000 >>> >>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:memory:4294967296.000000 >>> >>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:threads:4.000000 >>> >>> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:slots:4.000000 >>> 487297:1:RUNNING:1331037343:87300:G:global:penalty:10800.000000 >>> 487297:1:RUNNING:1331037343:87300:P:qlc-1:slots:8.000000 >>> >>> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000 >>> >>> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000 >>> >>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:tmpfs:42949672960.000000 >>> >>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:wex:4.000000 >>> >>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:memory:4294967296.000000 >>> >>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:threads:4.000000 >>> >>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:slots:4.000000 >>> >>> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000 >>> >>> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000 >>> >>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:tmpfs:42949672960.000000 >>> >>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:wex:4.000000 >>> >>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:memory:4294967296.000000 >>> >>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:threads:4.000000 >>> >>> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:slots:4.000000 >>> 487295:1:RUNNING:1331037343:87300:G:global:penalty:10800.000000 >>> 487295:1:RUNNING:1331037343:87300:P:qlc-1:slots:8.000000 >>> >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users >> >> > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
