On 27/2/23 03:34, David Laehnemann wrote:
Hi Chris, hi Sean,
Hiya!
thanks also (and thanks again) for chiming in.
No worries.
Quick follow-up question:
Would `squeue` be a better fall-back command than `scontrol` from the
perspective of keeping `slurmctld` responsive?
Sadly not, whils
On 27/2/23 06:53, Brian Andrus wrote:
Sorry, I had to share that this is very much like "Are we there yet?" on
a road trip with kids 😄
Slurm is trying to drive.
Oh I love this analogy!
Whereas sacct is like looking talking to the navigator. The navigator
does talk to the driver to give dir
Hi Brian,
thanks for your ideas. Follow-up questions, because further digging
through the docs didn't get me anywhere definitive on this:
IMHO, the true solution is that if a job's info NEEDS updated that
> often, have the job itself report what it is doing (but NOT via
> slurm
> commands). The
We have many jupyterhub jobs on our cluster that also does a lot of job
queries. Could adjust the query time. But what I did is that 1 process
queries all the jobs `squeue --json` and the jupyterhub query script
looks in this output.
Instead that every jupyterhub job queries the batch system. I
As a side note:
In Slurm 23.x a new rate limiting feature for client RPC calls was added:
(see this commit:
https://github.com/SchedMD/slurm/commit/674f118140e171d10c2501444a0040e1492f4eab#diff-b4e84d09d9b1d817a964fb78baba0a2ea6316bfc10c1405329a95ad0353ca33e
)
This would give operators the ability
> > And if you are seeing a workflow management system causing trouble on
> > your system, probably the most sustainable way of getting this resolved
> > is to file issues or pull requests with the respective project, with
> > suggestions like the ones you made. For snakemake, a second good point
>
Hi David,
David Laehnemann writes:
> Dear Ward,
>
> if used correctly (and that is a big caveat for any method for
> interacting with a cluster system), snakemake will only submit as many
> jobs as can fit within the resources of the cluster at one point of
> time (or however much resources you
Sorry, I had to share that this is very much like "Are we there yet?" on
a road trip with kids :)
Slurm is trying to drive. Any communication to slurmctld will involve an
RPC call (sinfo, squeue, scontrol, etc). You can see how many with sinfo.
Too many RPC calls will cause failures. Asking slu
Dear Ward,
if used correctly (and that is a big caveat for any method for
interacting with a cluster system), snakemake will only submit as many
jobs as can fit within the resources of the cluster at one point of
time (or however much resources you tell snakemake that it can use). So
unless there
On 24/02/2023 18:34, David Laehnemann wrote:
Those queries then should not have to happen too often, although do you
have any indication of a range for when you say "you still wouldn't
want to query the status too frequently." Because I don't really, and
would probably opt for some compromise of
Hi Chris, hi Sean,
thanks also (and thanks again) for chiming in.
Quick follow-up question:
Would `squeue` be a better fall-back command than `scontrol` from the
perspective of keeping `slurmctld` responsive? From what I can see in
the general overview of how slurm works (
https://slurm.schedmd.
On 23/2/23 2:55 am, David Laehnemann wrote:
And consequently, would using `scontrol` thus be the better default
option (as opposed to `sacct`) for repeated job status checks by a
workflow management system?
Many others have commented on this, but use of scontrol in this way is
really really b
Hi David,
Those queries then should not have to happen too often, although do you
> have any indication of a range for when you say "you still wouldn't
> want to query the status too frequently." Because I don't really, and
> would probably opt for some compromise of every 30 seconds or so.
>
Eve
Hi Sean,
Thanks again for all the feedback!
I'll definitely try to implement batch queries, then. Both for the
default `sacct` query and for the fallback `scontrol` query. Also see
here:
https://github.com/snakemake/snakemake/pull/2136#issuecomment-1443295051
Those queries then should not have t
Hi David,
On Thu, Feb 23, 2023 at 10:50 AM David Laehnemann
wrote:
> But from your comment I understand that handling these queries in
> batches would be less work for slurmdbd, right? So instead of querying
> each jobid with a separate database query, it would do one database
> query for the wh
Hi Sean,
yes, this is exactly what snakemake currently does. I didn't write that
code, but from my previous debugging, I think handling one job at a
time was simply the logic of the general executor for cluster systems,
and makes things like querying via scontrol as a fallback easier to
handle. Bu
Hi David,
On Thu, Feb 23, 2023 at 8:51 AM David Laehnemann
wrote:
> Quick follow-up question: do you have any indication of the rate of job
> status checks via sacct that slurmdbd will gracefully handle (per
> second)? Or any suggestions how to roughly determine such a rate for a
> given cluster
Hi David,
David Laehnemann writes:
[snip (16 lines)]
> P.S.: @Loris and @Noam: Exactly, snakemake is a software distinct from
> slurm that you can use to orchestrate large analysis workflows---on
> anything from a desktop or laptop computer to all kinds of cluster /
> cloud systems. In the case
Hi Sean, hi everybody,
thanks a lot for the quick insights!
My takeaway is: sacct is the better default for putting in lots of job
status checks after all, as it will not impact the slurmctld scheduler.
Quick follow-up question: do you have any indication of the rate of job
status checks via sac
Hi David,
scontrol - interacts with slurmctld using RPC, so it is faster, but
requests put load on the scheduler itself.
sacct - interacts with slurmdbd, so it doesn't place additional load on the
scheduler.
There is a balance to reach, but the scontrol approach is riskier and can
start to interf
On Feb 23, 2023, at 7:40 AM, Loris Bennett
mailto:loris.benn...@fu-berlin.de>> wrote:
Hi David,
David Laehnemann mailto:david.laehnem...@hhu.de>>
writes:
by a
workflow management system?
I am probably being a bit naive, but I would have thought that the batch
system should just be able start
Hi David,
David Laehnemann writes:
> Dear Slurm users and developers,
>
> TL;DR:
> Do any of you know if `scontrol` status checks of jobs are always
> expected to be quicker than `sacct` job status checks? Do you have any
> comparative timings between the two commands?
> And consequently, would
Dear Slurm users and developers,
TL;DR:
Do any of you know if `scontrol` status checks of jobs are always
expected to be quicker than `sacct` job status checks? Do you have any
comparative timings between the two commands?
And consequently, would using `scontrol` thus be the better default
option
23 matches
Mail list logo