Re: [slurm-users] Providing users with info on wait time vs. run time

Ole Holm Nielsen Thu, 15 Sep 2022 04:55:14 -0700

On 9/15/22 12:02, Loris Bennett wrote:

Today I spotted a job which requested an entire node, then had to wait
four around 16 hours and finally ran, apparently successfully, for less
than 4 minutes.


As it currently seems in general fashionable for users round here to
request the maximum number of cores available on a node without doing
any scaling experiments or considering backfill, it seems like it would
be a good idea to provide them with some feed back on wait/run times.

One option would be to write the information into the Slurm 'out' file
(currently we insert the output of 'seff).  Another option would be to
aggregate the times over, say, a month and provide a the absolute totals
and maybe a run-to-wait ratio.

Has anyone already done anything like this?

Perhaps marginally relevant: The slurmacct script reports an "Averagequeue hours" column which is the waiting time:

https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmacct

It would be possible to generate a job summary with waiting time dividedby run time by changing the script.


/Ole

Re: [slurm-users] Providing users with info on wait time vs. run time

Reply via email to