Hi!
We have what seems to be a similar type of load, and have in periods experienced the same problem.There are some parameters that can be used to tune the backfiller. We have had good results with setting bf_max_job_user to a small value (between 5 and 10), and bf_resolution to a large value (around 3600). bf_max_job_user is similar to Maui MAXIJOB limit; the backfiller will only try this many jobs for each user. This is especially useful if some users have many identical or nearly identical jobs in the queue.
I have tried tuning with bf_max_job_user and as you say it's especially useful with users having many identical jobs in the queue but I think it somewhat bad for the backfill not to look at the whole queue.
Many of our users that have many jobs do have more or less identical jobs but not all and then not looking at the complete queue would be bad for the user especially if you put in small jobs for testing purposes.
bf_resolution is the time resolution (in seconds) of the time slots used for estimating when a job can start. The default, 60 seconds, was way to low for us.
I will try increasing the resolution value and see if it will pick up speed with that.
Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet
smime.p7s
Description: S/MIME Cryptographic Signature