[gridengine users] Large cluster with memory reservation leaving cores idle

Christopher Black Mon, 07 Mar 2016 15:24:00 -0800

Greetings!
We are running SoGE (mix of 8.1.6 and 8.1.8, soon 8.1.8 everywhere) on a
~300 node cluster.
We utilize RQS and memory reservation via a complex to allow most nodes to
be shared among multiple queues and run a mix of single core and multi
core jobs.
Recently when we hit 10k+ jobs in qw, we are seeing the job dispatch rate
not keep up with how quickly jobs are finishing and leaving cores idle.
Our jobs aren't particularly short (avg ~2h).
We sometimes have a case where there are thousands of jobs not suitable
for execution due to hitting a per-queue RQS rule, but we still want other
jobs to get started on idle cores.


We have tried tuning some parameters but could use some advice as we are
now having trouble keeping all the cores busy despite there being many
eligible jobs in qw.

We have tried tuning max_advance_reservations,
max_functional_jobs_to_schedule, max_pending_tasks_per_job,
max_reservation as well as disabling schedd_job_info. We have applied some
of the scaling best practices such as using local spools. I saw mention of
MAX_DYN_EC but have not tried that yet, is it fairly safe to do so?
Any other changes we should consider?

One thing I am not clear on is whether max_functional_jobs_to_schedule
being low means that only the first n jobs with the highest calculated
priority are evaluated for starting. If this were true it would mean high
priority jobs that are not eligible for execution due to RQS or other
reasons would prevent other lower priority jobs from starting.

Any thoughts or suggestions?

Also, we sometimes see the following in spool/qmaster/messages:
03/07/2016 18:00:09|worker|pqmaster|E|resources no longer available for
start of job 7499766.1
03/07/2016 18:00:09|worker|pqmaster|E|debiting 31539000000.000000 of
h_vmem on host pnode141.nygenome.org for 1 slots would exceed remaining
capacity of 2986364800.000000
03/07/2016 18:00:09|worker|pqmaster|E|resources no longer available for
start of job 7499767.1
03/07/2016 18:00:09|worker|pqmaster|E|debiting 18022000000.000000 of
h_vmem on host pnode176.nygenome.org for 1 slots would exceed remaining
capacity of 9887037760.000000

I expect this is due to the memory reservation, but I'm not sure the exact
cause, if it is a problem, or if a parameter change might improve
operations. One theory is that when looking to do reservations on hundreds
of jobs, by the time it gets part way through the list the memory that
would have been reserved in consumable resource has been allocated to
another job, but I'm not sure as I don't see many hits on that log message.
(update: just found
http://arc.liv.ac.uk/pipermail/sge-bugs/2016-February.txt)
I don't know if this is a root cause of our problems leaving cores idle as
we see some of these even when everything is running fine.

Thanks,
Chris

Some config snippets showing non-default and potentially-relevant values,
I can put full output to a pastebin if it is useful:
qconf -srqs:
{
   name         slots_per_host
   description  Limit slots per host
   enabled      TRUE
   limit        hosts {@16core} to slots=16
   limit        hosts {@20core} to slots=20
   limit        hosts {@28core} to slots=28
   limit        hosts {!@physicalNodes} to slots=2
}
{
   name         io
   description  Limit max concurrent io.q slots
   enabled      TRUE
   limit        queues io.q to slots=300
}
{
   name         dev
   description  Limit max concurrent dev.q slots
   enabled      TRUE
   limit        queues dev.q to slots=250
}
{
   name         pipeline
   description  Limit max concurrent pipeline.q slots
   enabled      TRUE
   limit        queues pipeline.q to slots=4000
}
...other queues..

qconf -sc|grep mem (note default mem per job is 8GB and this is
consumable):
h_vmem              mem        MEMORY    <=      YES         JOB        8G
      0

A typical exechost qconf -se:
complex_values        h_vmem=240G,exclusive=true

qconf -sconf:
shell_start_mode             unix_behavior
reporting_params             accounting=true reporting=false \
                             flush_time=00:00:15 joblog=true
sharelog=00:00:00
finished_jobs                100
gid_range                    20000-20100
max_aj_instances             3000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
max_advance_reservations     50

qconf -msconf:
schedule_interval                 0:0:45
maxujobs                          0
queue_sort_method                 load

schedd_job_info                   false  (this used to be true, as qstat
-j on a stuck job can be useful)
params                            monitor=false
max_functional_jobs_to_schedule   1000
max_pending_tasks_per_job         50
max_reservation                   0  (used to be 50 to allow large jobs
with -R y to have a better chance to run)
default_duration                  4320:0:0

This electronic message is intended for the use of the named recipient only, 
and may contain information that is confidential, privileged or protected from 
disclosure under applicable law. If you are not the intended recipient, or an 
employee or agent responsible for delivering this message to the intended 
recipient, you are hereby notified that any reading, disclosure, dissemination, 
distribution, copying or use of the contents of this message including any of 
its attachments is strictly prohibited. If you have received this message in 
error or are not the named recipient, please notify us immediately by 
contacting the sender at the electronic mail address noted above, and destroy 
all copies of this message. Please note, the recipient should check this email 
and any attachments for the presence of viruses. The organization accepts no 
liability for any damage caused by any virus transmitted by this email.

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] Large cluster with memory reservation leaving cores idle

Reply via email to