Hi David, scontrol - interacts with slurmctld using RPC, so it is faster, but requests put load on the scheduler itself. sacct - interacts with slurmdbd, so it doesn't place additional load on the scheduler.
There is a balance to reach, but the scontrol approach is riskier and can start to interfere with the cluster operation if used incorrectly. Best, -Sean On Thu, Feb 23, 2023 at 5:59 AM David Laehnemann <david.laehnem...@hhu.de> wrote: > Dear Slurm users and developers, > > TL;DR: > Do any of you know if `scontrol` status checks of jobs are always > expected to be quicker than `sacct` job status checks? Do you have any > comparative timings between the two commands? > And consequently, would using `scontrol` thus be the better default > option (as opposed to `sacct`) for repeated job status checks by a > workflow management system? > > > And here's the long version with background infos and linkouts: > > I have recently started using a Slurm cluster and am a regular user of > the workflow management system snakemake ( > https://snakemake.readthedocs.io/en/latest/). This workflow manager > recently integrated support for running analysis workflows pretty > seamlessly on Slurm clusters. It takes care of managing all job > dependecies and handles the submission of jobs according to your global > (and job-specific) resource configurations. > > One little hiccup when starting to use the snakemake-Slurm combination > was a snakemake-internal rate-limitation for checking job statuses. You > can find the full story here: > https://github.com/snakemake/snakemake/pull/2136 > > For debugging this, I obtained timings on `sacct` and `scontrol`, with > `scontrol` consistently about 2.5x quicker in returning the job status > when compared to `sacct`. Timings are recorded here: > > https://github.com/snakemake/snakemake/blob/b91651d5ea2314b954a3b4b096d7f327ce743b94/snakemake/scheduler.py#L199-L210 > > However, currently `sacct` is used for regularly checking the status of > submitted jobs per default, and `scontrol` is only a fallback whenever > `sacct` doesn't find the job (for example because it is not yet > running). Now, I was wondering if switching the default to `scontrol` > would make sense. Thus, I would like to ask: > > 1) Slurm users, whether they also have similar timings on different > Slurm clusters and whether those confirm that `scontrol` is > consistently quicker? > > 2) Slurm developers, whether `scontrol` is expected to be quicker from > its implementation and whether using `scontrol` would also be the > option that puts less strain on the scheduler in general? > > Many thanks and best regards, > David > > >