I've got a workaround for my problem. First, to summarize for anyone with a
similar issue:

My setup has slots, h_vmem and s_vmem set as consumables. Jobs that
requested multiple slots and additional h_vmem/s_vmem (ram) above the
default value set in the complex configuration almost always ended up stuck
in the queue for an unpredictable amount of time with the message "PE
offers 0 slots", even though cluster resources were available, and the user
had not reached quota limits.

This was happening just for one queue.

My workaround is to set the 'slots' param for the queue to just the default
per-node value. I had been assigning different slot values to some of the
nodes to balance usage between different slots. That's it. I don't know
what's going on, but this works, and hopefully will keep working!

Thanks for all the help and suggestions! If the above workaround sheds any
light on the issue that might yield a more complete solution, please let me
know.

-M

On Thu, Aug 17, 2017 at 11:50 AM, Michael Stauffer <mgsta...@gmail.com>
wrote:

> On Thu, Aug 17, 2017 at 7:49 AM, Reuti <re...@staff.uni-marburg.de> wrote:
>
>>
>> > Am 13.08.2017 um 18:11 schrieb Michael Stauffer <mgsta...@gmail.com>:
>> >
>> > Thanks for the reply Reuti, see below
>> >
>> > On Fri, Aug 11, 2017 at 7:18 PM, Reuti <re...@staff.uni-marburg.de>
>> wrote:
>> >
>> > What I notice below: defining h_vmem/s_vmem on a queue level means per
>> job. Defining it on an exechost level means across all jobs. What is
>> different between:
>> >
>> > > ------------------------------------------------------------
>> ---------------------
>> > > all.q@compute-0-13.local       BP    0/10/16        9.14     lx-amd64
>> > >         qf:h_vmem=40.000G
>> > >         qf:s_vmem=40.000G
>> > >         hc:slots=6
>> > > ------------------------------------------------------------
>> ---------------------
>> > > all.q@compute-0-14.local       BP    0/10/16        9.66     lx-amd64
>> > >         hc:h_vmem=28.890G
>> > >         hc:s_vmem=30.990G
>> > >         hc:slots=6
>> >
>> >
>> > qf = queue fixed
>> > hc = host consumable
>> >
>> > What is the definition of h_vmem/s_vmem in `qconf -sc` and their
>> default consumptions?
>> >
>> > I thought this means that when it's showing qf, it's the per-job queue
>> limit, i.e. the queue has a h_vmem and s_vmem limits for the job of 40G
>> (which it does). And then hc is shown when the host resources are less than
>> the per-job queue limit.
>>
>> Yes, the lower limit should be shown. So it's defined on both sides:
>> exechost and queue?
>
>
> Yes, the queue has a 40GB per-job limit, and h_vmem and s_vmem are
> consumables on the exechosts
>
> -M
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to