On Mon, Mar 07, 2016 at 11:20:04PM +0000, Christopher Black wrote:
> Greetings!
> We are running SoGE (mix of 8.1.6 and 8.1.8, soon 8.1.8 everywhere) on a
> ~300 node cluster.
> We utilize RQS and memory reservation via a complex to allow most nodes to
> be shared among multiple queues and run a mix of single core and multi
> core jobs.
> Recently when we hit 10k+ jobs in qw, we are seeing the job dispatch rate
> not keep up with how quickly jobs are finishing and leaving cores idle.
> Our jobs aren't particularly short (avg ~2h).
> We sometimes have a case where there are thousands of jobs not suitable
> for execution due to hitting a per-queue RQS rule, but we still want other
> jobs to get started on idle cores.
> 
> We have tried tuning some parameters but could use some advice as we are
> now having trouble keeping all the cores busy despite there being many
> eligible jobs in qw.

If you have a complex controlling memory and users are requesting lost of memory
then a node could be "full" with idle cores.  Have you tried running qalter -w 
p 
on the highest priority job you think should run?  What does it say?

> 
> We have tried tuning max_advance_reservations,
> max_functional_jobs_to_schedule, max_pending_tasks_per_job,
> max_reservation as well as disabling schedd_job_info. We have applied some
> of the scaling best practices such as using local spools. I saw mention of
> MAX_DYN_EC but have not tried that yet, is it fairly safe to do so?
> Any other changes we should consider?

Is this just a delay in the scheduler running?  What values do you have 
for scheduler_interval, flush_submit_sec and flush_finish_sec in the scheduler
config?

Is the scheduler taking a long time to do its work?  Set PROFILE to
1 in the scheduler's params and look for entries in the messages file
(or wherever you send such messages if logging via syslog) containing
the string "schedd run took:" to see how long it is taking.  This 
number can vary a lot depending on cluster config and the sort
of requests submitted to it.  On one of our clusters a scheduling run takes
3-4 seconds (could be shorter still if we disabled reservations).  Another 
one was taking 10-15 minutes or more for a scheduling run before I tweaked it.  
A lot of jobs can finish in 10-15 minutes.

> Any thoughts or suggestions?
> 
> Also, we sometimes see the following in spool/qmaster/messages:
> 03/07/2016 18:00:09|worker|pqmaster|E|resources no longer available for
> start of job 7499766.1
> 03/07/2016 18:00:09|worker|pqmaster|E|debiting 31539000000.000000 of
> h_vmem on host pnode141.nygenome.org for 1 slots would exceed remaining
> capacity of 2986364800.000000
> 03/07/2016 18:00:09|worker|pqmaster|E|resources no longer available for
> start of job 7499767.1
> 03/07/2016 18:00:09|worker|pqmaster|E|debiting 18022000000.000000 of
> h_vmem on host pnode176.nygenome.org for 1 slots would exceed remaining
> capacity of 9887037760.000000
> 
> I expect this is due to the memory reservation, but I'm not sure the exact
> cause, if it is a problem, or if a parameter change might improve
> operations. One theory is that when looking to do reservations on hundreds
> of jobs, by the time it gets part way through the list the memory that
> would have been reserved in consumable resource has been allocated to
> another job, but I'm not sure as I don't see many hits on that log message.
> (update: just found
> http://arc.liv.ac.uk/pipermail/sge-bugs/2016-February.txt)
> I don't know if this is a root cause of our problems leaving cores idle as
> we see some of these even when everything is running fine.

That message was from me.  In that case numbers were off by a fraction due 
AFACT to
the qmaster and scheduler rounding differently.  The examples you quote are 
very 
different.  Not the same problem I think.

William

Attachment: signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to