I've got a workaround for my problem. First, to summarize for anyone with a similar issue:
My setup has slots, h_vmem and s_vmem set as consumables. Jobs that requested multiple slots and additional h_vmem/s_vmem (ram) above the default value set in the complex configuration almost always ended up stuck in the queue for an unpredictable amount of time with the message "PE offers 0 slots", even though cluster resources were available, and the user had not reached quota limits. This was happening just for one queue. My workaround is to set the 'slots' param for the queue to just the default per-node value. I had been assigning different slot values to some of the nodes to balance usage between different slots. That's it. I don't know what's going on, but this works, and hopefully will keep working! Thanks for all the help and suggestions! If the above workaround sheds any light on the issue that might yield a more complete solution, please let me know. -M On Thu, Aug 17, 2017 at 11:50 AM, Michael Stauffer <mgsta...@gmail.com> wrote: > On Thu, Aug 17, 2017 at 7:49 AM, Reuti <re...@staff.uni-marburg.de> wrote: > >> >> > Am 13.08.2017 um 18:11 schrieb Michael Stauffer <mgsta...@gmail.com>: >> > >> > Thanks for the reply Reuti, see below >> > >> > On Fri, Aug 11, 2017 at 7:18 PM, Reuti <re...@staff.uni-marburg.de> >> wrote: >> > >> > What I notice below: defining h_vmem/s_vmem on a queue level means per >> job. Defining it on an exechost level means across all jobs. What is >> different between: >> > >> > > ------------------------------------------------------------ >> --------------------- >> > > all.q@compute-0-13.local BP 0/10/16 9.14 lx-amd64 >> > > qf:h_vmem=40.000G >> > > qf:s_vmem=40.000G >> > > hc:slots=6 >> > > ------------------------------------------------------------ >> --------------------- >> > > all.q@compute-0-14.local BP 0/10/16 9.66 lx-amd64 >> > > hc:h_vmem=28.890G >> > > hc:s_vmem=30.990G >> > > hc:slots=6 >> > >> > >> > qf = queue fixed >> > hc = host consumable >> > >> > What is the definition of h_vmem/s_vmem in `qconf -sc` and their >> default consumptions? >> > >> > I thought this means that when it's showing qf, it's the per-job queue >> limit, i.e. the queue has a h_vmem and s_vmem limits for the job of 40G >> (which it does). And then hc is shown when the host resources are less than >> the per-job queue limit. >> >> Yes, the lower limit should be shown. So it's defined on both sides: >> exechost and queue? > > > Yes, the queue has a 40GB per-job limit, and h_vmem and s_vmem are > consumables on the exechosts > > -M >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users