On 7 March 2012 10:11, Mazouzi <[email protected]> wrote: > I remember Reuti proposed a solution using RQS: > > { > name noverload > description Make sure host will not take more than 1 process per > processor > enabled TRUE > limit hosts {*} to slots=$num_proc > } > While I could switch to using RQS rather than host consumables to control slot usage I'd rather understand why a solution that AFAICT worked perfectly up to now has stopped working for these two jobs/hosts. Without that understanding I have no guarantee that the RQS solution won't have the same issue. Is there any reason to believe the RQS solution will be more reliable than the host consumable solution (which has worked pretty well up to now)?
> Regards, > On Wed, Mar 7, 2012 at 11:00 AM, William Hay <[email protected]> wrote: >> >> On 7 March 2012 09:47, William Hay <[email protected]> wrote: >> > We have multiple queue instances on each node each with slots equal to >> > the number of cpus. To prevent oversubscription I added a slots >> > consumable to each host restricting it to a number of slots equal to >> > the cpus on the node. >> > This has worked up to now but this morning there are a couple of jobs >> > that have managed to overconsume host slots using all the slots in two >> > queues or twice the number defined for the node as a whole. Using >> > qstat -f I see the following: >> > >> > hc:slots=-4 >> > >> > Checking $SGE_ROOT/$SGE_CELL/spool/exec_hosts/<nodename> shows a >> > modification time prior to the job starting and the right number of >> > slots configured (as does the output of qconf -se hostname) so it's >> > not a case of the >> > value of the host consumable being reduced after job start. There are >> > no other jobs running on these nodes. >> > >> > Any idea what could cause this? Or how to prevent it in future? >> >> A bit of extra information the schedule file contains the following >> lines adjacent to each other (ie same scheduling cycle): >> Note how each job reserves resources twice on the same node (including >> wex which is an exclusive resource). >> >> >> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000 >> >> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000 >> >> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:tmpfs:42949672960.000000 >> >> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:wex:4.000000 >> >> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:memory:4294967296.000000 >> >> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:threads:4.000000 >> >> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:slots:4.000000 >> >> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000 >> >> 487297:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000 >> >> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:tmpfs:42949672960.000000 >> >> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:wex:4.000000 >> >> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:memory:4294967296.000000 >> >> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:threads:4.000000 >> >> 487297:1:RUNNING:1331037343:87300:H:node-a40.data.legion.ucl.ac.uk:slots:4.000000 >> 487297:1:RUNNING:1331037343:87300:G:global:penalty:10800.000000 >> 487297:1:RUNNING:1331037343:87300:P:qlc-1:slots:8.000000 >> >> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000 >> >> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000 >> >> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:tmpfs:42949672960.000000 >> >> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:wex:4.000000 >> >> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:memory:4294967296.000000 >> >> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:threads:4.000000 >> >> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:slots:4.000000 >> >> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:slots:4.000000 >> >> 487295:1:RUNNING:1331037343:87300:Q:[email protected]:xex:4.000000 >> >> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:tmpfs:42949672960.000000 >> >> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:wex:4.000000 >> >> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:memory:4294967296.000000 >> >> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:threads:4.000000 >> >> 487295:1:RUNNING:1331037343:87300:H:node-a36.data.legion.ucl.ac.uk:slots:4.000000 >> 487295:1:RUNNING:1331037343:87300:G:global:penalty:10800.000000 >> 487295:1:RUNNING:1331037343:87300:P:qlc-1:slots:8.000000 >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
