Re: [gridengine users] Large cluster with memory reservation leaving cores idle

Christopher Black Tue, 08 Mar 2016 09:33:12 -0800

Thanks for the reply Reuti!

Sounds like some of the suggestions are moving limits out of RQS and into
complexes and consumable resources. How do we make that happen without
requiring users to add -l bits to their qsubs?

On 3/8/16, 7:32 AM, "Reuti" <[email protected]> wrote:

>I saw cases were RQS blocks further scheduling and shows up in `qstat -j`
>with a cryptic message. Although this was in 6.2u5, I don't know whether
>there was any work in this area to fix it.
>
>Often you can spot it in the scheduling output that an RQS was violated
>although it's not true that the rule is violated. For me it kicked in
>when I requested a complex with a load value in the submission command.
>
>cannot run because it exceeds limit "////node20/" in rule "general/slots"

I've also seen cryptic qstat -j messages about slots not available that
ended up requiring changing values in a pe, but that was settled a while
ago.
Within queue definitions we have:
slots                 1,[@16core=16],[@20core=20],[@28core=28]

Those are per node limits, unsure how to change this to total slots for a
queue across all eligible nodes.

We have 20+ queues and use RQS, host groups and disabling/enabling queue
instances to manage balancing nodes and load between queues.

When I turn schedd_job_info back on and look at qstat -j, I sometimes (but
not often) see those "exceeds limit" entries, but before that there are
MANY entries like:
queue instance "[email protected]" dropped because it is
temporarily not available
queue instance "[email protected]" dropped because it is
disabled

And these are for queues other than the one specified in -q
hard_queue_list. I am wondering if qmaster is giving up checking eligible
matching queue instances after checking all of these disabled instances
for other queues. Perhaps utilizing hostgroups in queue definitions would
be more efficient than disabling queue instances.

>AFAICS:
>
>>
>> Some config snippets showing non-default and potentially-relevant
>>values,
>> I can put full output to a pastebin if it is useful:
>> qconf -srqs:
>> {
>>   name         slots_per_host
>>   description  Limit slots per host
>>   enabled      TRUE
>>   limit        hosts {@16core} to slots=16
>>   limit        hosts {@20core} to slots=20
>>   limit        hosts {@28core} to slots=28
>>   limit        hosts {!@physicalNodes} to slots=2
>> }
>
>The above RQS could be put in individual complex_values per exechost. Yes
>- the above is handy, I know.

Is the idea to define a new complex (qconf -mc) and then set it per
exechost (qconf -me), or re-use "slots" somehow?

Current qconf -mc |grep slot:
slots               s          INT       <=      YES         YES        1
      1000

Putting slots per exechost wouldn't be too awful to script (qconf -mattr)
but we'd need it to work without asking users to change qsub commands.
Would this be more efficient than having it in the RQS stanza above?

>> {
>>   name         io
>>   description  Limit max concurrent io.q slots
>>   enabled      TRUE
>>   limit        queues io.q to slots=300
>> }
>> {
>>   name         dev
>>   description  Limit max concurrent dev.q slots
>>   enabled      TRUE
>>   limit        queues dev.q to slots=250
>> }
>> {
>>   name         pipeline
>>   description  Limit max concurrent pipeline.q slots
>>   enabled      TRUE
>>   limit        queues pipeline.q to slots=4000
>> }
>> ...other queues..
>
>Here one could use a global complex for each type of queue, as long as
>the users specify the particular queue. One will lose the ability that
>potentially a job may be scheduled to different types of queues, as long
>as resource requests are met.

Sounds like we could switch from RQS for queue core limits to complex per
queue, but I'm not sure how to make all slots for qsub -q foo.q
automatically debit against a consumable resource like foo_slots. What
would we need to do beyond defining the global complex, setting value and
marking it consumable?

This would be a big change to the way we currently limit load across
queues and nodes but we are willing to try to get past our issues.

>I can't predict whether this would improve anything in the situation you
>face.

Understood!

Thanks!
Chris

>> qconf -sc|grep mem (note default mem per job is 8GB and this is
>> consumable):
>> h_vmem              mem        MEMORY    <=      YES         JOB
>>8G
>>      0
>>
>> A typical exechost qconf -se:
>> complex_values        h_vmem=240G,exclusive=true
>>
>> qconf -sconf:
>> shell_start_mode             unix_behavior
>> reporting_params             accounting=true reporting=false \
>>                             flush_time=00:00:15 joblog=true
>> sharelog=00:00:00
>> finished_jobs                100
>> gid_range                    20000-20100
>> max_aj_instances             3000
>> max_aj_tasks                 75000
>> max_u_jobs                   0
>> max_jobs                     0
>> max_advance_reservations     50
>>
>> qconf -msconf:
>> schedule_interval                 0:0:45
>> maxujobs                          0
>> queue_sort_method                 load
>>
>> schedd_job_info                   false  (this used to be true, as qstat
>> -j on a stuck job can be useful)
>> params                            monitor=false
>> max_functional_jobs_to_schedule   1000
>> max_pending_tasks_per_job         50
>> max_reservation                   0  (used to be 50 to allow large jobs
>> with -R y to have a better chance to run)
>> default_duration                  4320:0:0
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>
>

This electronic message is intended for the use of the named recipient only, 
and may contain information that is confidential, privileged or protected from 
disclosure under applicable law. If you are not the intended recipient, or an 
employee or agent responsible for delivering this message to the intended 
recipient, you are hereby notified that any reading, disclosure, dissemination, 
distribution, copying or use of the contents of this message including any of 
its attachments is strictly prohibited. If you have received this message in 
error or are not the named recipient, please notify us immediately by 
contacting the sender at the electronic mail address noted above, and destroy 
all copies of this message. Please note, the recipient should check this email 
and any attachments for the presence of viruses. The organization accepts no 
liability for any damage caused by any virus transmitted by this email.

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Large cluster with memory reservation leaving cores idle

Reply via email to