Hi We've run into similar problems with backfill (though not apparently of the scale you've got). We have a number of users who will drop 5,000+ jobs at once- as you've indicated, this can play havoc with backfill.
One of the newer* parameters for the backfill scheduler that's been a real help for us is "bf_max_job_assoc" and "bf_max_job_user". These limit the number of jobs the scheduler considers per association and user. SchedulerParameters=bf_continue,bf_interval=120,bf_job_part_count_reserve=6,bf_window=43200,bf_resolution=1800,bf_max_job_user=200,bf_max_job_assoc=200,bf_max_job_part=500,bf_max_job_test=2000,bf_yield_interval=1000000,default_queue_depth=500,defer,partition_job_depth=300,max_rpc_cnt=200,preempt_youngest_first - Michael *I think these are newer- I don't actually know when those were added (I'm currently on 17.11.5) On Wed, Oct 10, 2018 at 6:08 PM Richard Feltstykket < rafeltstyk...@ucdavis.edu> wrote: > Hello list, > > My cluster usually has a pretty heterogenous job load and spends a lot of > the time memory bound. Ocassionally I have users that submit 100k+ short, > low resource jobs. Despite having several thousand free cores and enough > RAM to run the jobs, the backfill scheduler would never backfill them. It > turns out that there were a number of factors: They were deep down in the > queue, from an account with low priority, and there were a lot of them for > the scheduler to consider. After a bunch of tuning, the backfill scheduler > parameters I finally settled on are: > > > SchedulerParameters=defer,bf_continue,bf_interval=20,bf_resolution=600,bf_yield_interval=1000000,sched_min_interval=2000000,bf_max_time=600,bf_max_job_test=1000000 > > After implementing these changes the backfill scheduler began to > successfully schedule these jobs on the cluster. While the cluster has a > deep queue, the load on the slurmctld host can get pretty high. However no > users have reported issues with responsivenes of the various slurm commands > and the backup controller has never taken over either. Changes have been > in place for a month or so with no ill effects that I have observed. > > While I was troubleshooting I was definitely combing the list archives for > other people's tuning suggestions, so I figured I would post a message here > for posterity as well as see if anyone has similiar settings or feedback > :-). > > Cheers, > Richard >