[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

Ole Holm Nielsen via slurm-users Thu, 14 Mar 2024 10:16:57 -0700

Hi Simon,

Maybe you could print the user's limits using this tool:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits


Which version of Slurm do you run?

/Ole

On 3/14/24 17:47, Simon Andrews via slurm-users wrote:

Our cluster has developed a strange intermittent behaviour where jobs arebeing put into a pending state because they aren’t passing theAssocGrpCpuLimit, even though the user submitting has enough cpus for thejob to run.
For example:

$ squeue -o "%.6i %.9P %.8j %.8u %.2t %.10M %.7m %.7c %.20R"
JOBID PARTITION NAME USER ST TIME MIN_MEM MIN_CPUNODELIST(REASON)
799 normal hostname andrewss PD 0:00 2G 5(AssocGrpCpuLimit)
..so the job isn’t running, and it’s the only job in the queue, but:
$ sacctmgr list associations part=normal user=andrewssformat=Account,User,Partition,Share,GrpTRES
    Account       User  Partition     Share       GrpTRES

---------- ---------- ---------- --------- -------------

   andrewss   andrewss     normal         1         cpu=5

That user has a limit of 5 CPUs so the job should run.
The weird thing is that this effect is intermittent. A job can hang andthe queue will stall for ages but will then suddenly start working and youcan submit several jobs and they all work, until one fails again.
The cluster has active nodes and plenty of resource:

$ sinfo

PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST

normal*        up   infinite      2   idle compute-0-[6-7]

interactive    up 1-12:00:00      3   idle compute-1-[0-1,3]

The slurmctld log just says:
[2024-03-14T16:21:41.275] _slurm_rpc_submit_batch_job: JobId=799InitPrio=4294901720 usec=259
Whilst it’s in this state I can run other jobs with core requests of up to4 and they work, but not 5. It’s like slurm is adding one CPU to therequest and then denying it.
//
I’m sure I’m missing something fundamental but would appreciate it ifsomeone could point out what it is!


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

Reply via email to