Not answering every question below, but for (1) we're at 200 on a cluster with 
a few dozen nodes and around 1k cores, as per 
https://lists.schedmd.com/pipermail/slurm-users/2021-June/007463.html -- there 
may be other settings in that email that could be beneficial. We had a lot of 
idle resources that could have been backfilled with short, lower-priority jobs, 
and this basically resolved it.

For (3), I think https://slurm.schedmd.com/sprio.html would be my first stop.

For (4), as far as I know, that's a setting for all partitions.

From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of David 
Henkemeyer <david.henkeme...@gmail.com>
Date: Wednesday, January 12, 2022 at 11:27 AM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Questions about default_queue_depth

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

________________________________
Hello,

A few weeks ago, we tested Slurm against about 50K jobs, and observed at least 
one instance where a node went idle, while there were jobs on the queue that 
could have run on the idle node.  The best guess as to why this occurred, at 
this point, is that the default_queue_depth was set to the default value of 
100, and that the queued jobs were likely not in the first 100 jobs in the 
queue.  Based on this, I have a few questions:
1) What is a reasonable value for default_queue_depth?  Would 1000 be ok, in 
terms of performance?
2) How can we better debug why queued jobs are not being selected?
3) Is there a way to see the order of the jobs in the queue?  Perhaps squeue 
lists the jobs in order?
3) If we had several partitions, would the default_queue_dpeth apply to all 
partitions?

Thank you
David

Reply via email to