Hi Chris, hi Sean, thanks also (and thanks again) for chiming in.
Quick follow-up question: Would `squeue` be a better fall-back command than `scontrol` from the perspective of keeping `slurmctld` responsive? From what I can see in the general overview of how slurm works ( https://slurm.schedmd.com/overview.html), both query `slurmctld`. But would one be "better" than the other, as in generating less work for `slurmctld`? Or will it roughly be an equivalent amount of work, so that we can rather see which set of command-line arguments better suits our needs? Also, just as a quick heads-up: I am documenting your input by linking to the mailing list archives, I hope that's alright for you? https://github.com/snakemake/snakemake/pull/2136#issuecomment-1446170467 cheers, david On Sat, 2023-02-25 at 10:51 -0800, Chris Samuel wrote: > On 23/2/23 2:55 am, David Laehnemann wrote: > > > And consequently, would using `scontrol` thus be the better default > > option (as opposed to `sacct`) for repeated job status checks by a > > workflow management system? > > Many others have commented on this, but use of scontrol in this way > is > really really bad because of the impact it has on slurmctld. This is > because responding to the RPC (IIRC) requires taking read locks on > internal data structures and on a large, busy system (like ours, we > recently rolled over slurm job IDs back to 1 after ~6 years of > operation > and run at over 90% occupancy most of the time) this can really > damage > scheduling performance. > > We've had numerous occasions where we've had to track down users > abusing > scontrol in this way and redirect them to use sacct instead. > > We already use the cli filter abilities in Slurm to impose a form of > rate limiting on RPCs from other commands, but unfortunately scontrol > is > not covered by that. > > All the best, > Chris