Dear Ward, if used correctly (and that is a big caveat for any method for interacting with a cluster system), snakemake will only submit as many jobs as can fit within the resources of the cluster at one point of time (or however much resources you tell snakemake that it can use). So unless there are thousands of cores available (or you "lie" to snakemake, telling it that there are much more cores than actually exist), it will only ever submit hundreds of jobs (or a lot less, if the jobs each require multiple cores). Accordingly, any queries will also only be for this number of jobs that snakemake currently has submitted. And snakemake will only submit new jobs, once it registers previously submitted jobs as finished.
So workflow managers can actually help reduce the strain on the scheduler, by only ever submitting stuff within the general limits of the system (as opposed to, for example, using some bash loop to just submit all of your analysis steps or samples at once). And for example, snakemake has a mechanism to batch a number of smaller jobs into larger jobs for submission on the cluster, so this might be something to suggest to your users that cause trouble through using snakemake (especially the `--group-components` mechanism): https://snakemake.readthedocs.io/en/latest/executing/grouping.html The query mechanism for job status is a different story. And I'm specifically here on this mailing list to get as much input as possible to improve this -- and welcome anybody who wants to chime in on my respective work-in-progress pull request right here: https://github.com/snakemake/snakemake/pull/2136 And if you are seeing a workflow management system causing trouble on your system, probably the most sustainable way of getting this resolved is to file issues or pull requests with the respective project, with suggestions like the ones you made. For snakemake, a second good point to currently chime in, would be the issue discussing Slurm job array support: https://github.com/snakemake/snakemake/issues/301 And for Nextflow, another commonly used workflow manager in my field (bioinformatics), there's also an issue discussing Slurm job array support: https://github.com/nextflow-io/nextflow/issues/1477 cheers, david On Mon, 2023-02-27 at 13:24 +0100, Ward Poelmans wrote: > On 24/02/2023 18:34, David Laehnemann wrote: > > Those queries then should not have to happen too often, although do > > you > > have any indication of a range for when you say "you still wouldn't > > want to query the status too frequently." Because I don't really, > > and > > would probably opt for some compromise of every 30 seconds or so. > > I think this is exactly why hpc sys admins are sometimes not very > happy about these tools. You're talking about 10000 of jobs on one > hand yet you want fetch the status every 30 seconds? What is the > point of that other then overloading the scheduler? > > We're telling your users not to query the slurm too often and usually > give 5 minutes as a good interval. You have to let slurm do it's job. > There is no point in querying in a loop every 30 seconds when we're > talking about large numbers of jobs. > > > Ward