On Mon, Mar 07, 2016 at 11:20:04PM +0000, Christopher Black wrote: > Greetings! > We are running SoGE (mix of 8.1.6 and 8.1.8, soon 8.1.8 everywhere) on a > ~300 node cluster. > We utilize RQS and memory reservation via a complex to allow most nodes to > be shared among multiple queues and run a mix of single core and multi > core jobs. > Recently when we hit 10k+ jobs in qw, we are seeing the job dispatch rate > not keep up with how quickly jobs are finishing and leaving cores idle. > Our jobs aren't particularly short (avg ~2h). > We sometimes have a case where there are thousands of jobs not suitable > for execution due to hitting a per-queue RQS rule, but we still want other > jobs to get started on idle cores. > > We have tried tuning some parameters but could use some advice as we are > now having trouble keeping all the cores busy despite there being many > eligible jobs in qw.
If you have a complex controlling memory and users are requesting lost of memory then a node could be "full" with idle cores. Have you tried running qalter -w p on the highest priority job you think should run? What does it say? > > We have tried tuning max_advance_reservations, > max_functional_jobs_to_schedule, max_pending_tasks_per_job, > max_reservation as well as disabling schedd_job_info. We have applied some > of the scaling best practices such as using local spools. I saw mention of > MAX_DYN_EC but have not tried that yet, is it fairly safe to do so? > Any other changes we should consider? Is this just a delay in the scheduler running? What values do you have for scheduler_interval, flush_submit_sec and flush_finish_sec in the scheduler config? Is the scheduler taking a long time to do its work? Set PROFILE to 1 in the scheduler's params and look for entries in the messages file (or wherever you send such messages if logging via syslog) containing the string "schedd run took:" to see how long it is taking. This number can vary a lot depending on cluster config and the sort of requests submitted to it. On one of our clusters a scheduling run takes 3-4 seconds (could be shorter still if we disabled reservations). Another one was taking 10-15 minutes or more for a scheduling run before I tweaked it. A lot of jobs can finish in 10-15 minutes. > Any thoughts or suggestions? > > Also, we sometimes see the following in spool/qmaster/messages: > 03/07/2016 18:00:09|worker|pqmaster|E|resources no longer available for > start of job 7499766.1 > 03/07/2016 18:00:09|worker|pqmaster|E|debiting 31539000000.000000 of > h_vmem on host pnode141.nygenome.org for 1 slots would exceed remaining > capacity of 2986364800.000000 > 03/07/2016 18:00:09|worker|pqmaster|E|resources no longer available for > start of job 7499767.1 > 03/07/2016 18:00:09|worker|pqmaster|E|debiting 18022000000.000000 of > h_vmem on host pnode176.nygenome.org for 1 slots would exceed remaining > capacity of 9887037760.000000 > > I expect this is due to the memory reservation, but I'm not sure the exact > cause, if it is a problem, or if a parameter change might improve > operations. One theory is that when looking to do reservations on hundreds > of jobs, by the time it gets part way through the list the memory that > would have been reserved in consumable resource has been allocated to > another job, but I'm not sure as I don't see many hits on that log message. > (update: just found > http://arc.liv.ac.uk/pipermail/sge-bugs/2016-February.txt) > I don't know if this is a root cause of our problems leaving cores idle as > we see some of these even when everything is running fine. That message was from me. In that case numbers were off by a fraction due AFACT to the qmaster and scheduler rounding differently. The examples you quote are very different. Not the same problem I think. William
signature.asc
Description: Digital signature
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users