Hi David, David Laehnemann <david.laehnem...@hhu.de> writes:
> Dear Slurm users and developers, > > TL;DR: > Do any of you know if `scontrol` status checks of jobs are always > expected to be quicker than `sacct` job status checks? Do you have any > comparative timings between the two commands? > And consequently, would using `scontrol` thus be the better default > option (as opposed to `sacct`) for repeated job status checks by a > workflow management system? I am probably being a bit naive, but I would have thought that the batch system should just be able start your jobs when resources become available. Why do you need to check the status of jobs? I would tend to think that it is not something users should be doing. Cheers, Loris > And here's the long version with background infos and linkouts: > > I have recently started using a Slurm cluster and am a regular user of > the workflow management system snakemake ( > https://snakemake.readthedocs.io/en/latest/). This workflow manager > recently integrated support for running analysis workflows pretty > seamlessly on Slurm clusters. It takes care of managing all job > dependecies and handles the submission of jobs according to your global > (and job-specific) resource configurations. > > One little hiccup when starting to use the snakemake-Slurm combination > was a snakemake-internal rate-limitation for checking job statuses. You > can find the full story here: > https://github.com/snakemake/snakemake/pull/2136 > > For debugging this, I obtained timings on `sacct` and `scontrol`, with > `scontrol` consistently about 2.5x quicker in returning the job status > when compared to `sacct`. Timings are recorded here: > https://github.com/snakemake/snakemake/blob/b91651d5ea2314b954a3b4b096d7f327ce743b94/snakemake/scheduler.py#L199-L210 > > However, currently `sacct` is used for regularly checking the status of > submitted jobs per default, and `scontrol` is only a fallback whenever > `sacct` doesn't find the job (for example because it is not yet > running). Now, I was wondering if switching the default to `scontrol` > would make sense. Thus, I would like to ask: > > 1) Slurm users, whether they also have similar timings on different > Slurm clusters and whether those confirm that `scontrol` is > consistently quicker? > > 2) Slurm developers, whether `scontrol` is expected to be quicker from > its implementation and whether using `scontrol` would also be the > option that puts less strain on the scheduler in general? > > Many thanks and best regards, > David -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin