Re: [slurm-users] speed / efficiency of sacct vs. scontrol

Bas van der Vlies Mon, 27 Feb 2023 08:39:55 -0800

We have many jupyterhub jobs on our cluster that also does a lot of jobqueries. Could adjust the query time. But what I did is that 1 processqueries all the jobs `squeue --json` and the jupyterhub query scriptlooks in this output.

Instead that every jupyterhub job queries the batch system. I have onlyone. This is specific to the hub environment but if a lot of users run

snakemake you hit the same problem.

As admin I can understand the queries and it is not only snakemake thereare plenty of other tools like hub that also do a lot queries. Some kind

of caching mechanism is nice. Most solve it with a wrapper script.

Just my 2 cents


On 27/02/2023 15:53, Brian Andrus wrote:

Sorry, I had to share that this is very much like "Are we there yet?" ona road trip with kids :)
Slurm is trying to drive. Any communication to slurmctld will involve anRPC call (sinfo, squeue, scontrol, etc). You can see how many with sinfo.Too many RPC calls will cause failures. Asking slurmdbd will not do thatto you. In fact, you could have a separate slurmdbd just for queries ifyou wanted. This is why that was suggested as a better option.
So, even if you run 'squeue' once every few seconds, it would impact thesystem. More so depending on the size of the system. We have had thatissue with users running 'watch squeue' and had to address it.
IMHO, the true solution is that if a job's info NEEDS updated thatoften, have the job itself report what it is doing (but NOT via slurmcommands). There are numerous ways to do that for most jobs.
Perhaps there is some additional lines that could be added to the jobthat would do a call to a snakemake API and report itself? Or maybe suchan API could be created/expanded.
Just a quick 2 cents (We may be up to a few dollars with all of those sofar).
Brian Andrus


On 2/27/2023 4:24 AM, Ward Poelmans wrote:
On 24/02/2023 18:34, David Laehnemann wrote:
Those queries then should not have to happen too often, although do you
have any indication of a range for when you say "you still wouldn't
want to query the status too frequently." Because I don't really, and
would probably opt for some compromise of every 30 seconds or so.
I think this is exactly why hpc sys admins are sometimes not veryhappy about these tools. You're talking about 10000 of jobs on onehand yet you want fetch the status every 30 seconds? What is the pointof that other then overloading the scheduler?
We're telling your users not to query the slurm too often and usuallygive 5 minutes as a good interval. You have to let slurm do it's job.There is no point in querying in a loop every 30 seconds when we'retalking about large numbers of jobs.
Ward


--
--
Bas van der Vlies

| High Performance Computing & Visualization | SURF| Science Park 140 |1098 XG Amsterdam

| T +31 (0) 20 800 1300  | bas.vandervl...@surf.nl | www.surf.nl |

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

Reply via email to