Our cluster has developed a strange intermittent behaviour where jobs are being 
put into a pending state because they aren't passing the AssocGrpCpuLimit, even 
though the user submitting has enough cpus for the job to run.

For example:

$ squeue -o "%.6i %.9P %.8j %.8u %.2t %.10M %.7m %.7c %.20R"
JOBID PARTITION     NAME     USER ST       TIME MIN_MEM MIN_CPU     
NODELIST(REASON)
   799    normal hostname andrewss PD       0:00      2G       5   
(AssocGrpCpuLimit)

..so the job isn't running, and it's the only job in the queue, but:

$ sacctmgr list associations part=normal user=andrewss 
format=Account,User,Partition,Share,GrpTRES
   Account       User  Partition     Share       GrpTRES
---------- ---------- ---------- --------- -------------
  andrewss   andrewss     normal         1         cpu=5

That user has a limit of 5 CPUs so the job should run.

The weird thing is that this effect is intermittent.  A job can hang and the 
queue will stall for ages but will then suddenly start working and you can 
submit several jobs and they all work, until one fails again.

The cluster has active nodes and plenty of resource:

$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*        up   infinite      2   idle compute-0-[6-7]
interactive    up 1-12:00:00      3   idle compute-1-[0-1,3]

The slurmctld log just says:

[2024-03-14T16:21:41.275] _slurm_rpc_submit_batch_job: JobId=799 
InitPrio=4294901720 usec=259

Whilst it's in this state I can run other jobs with core requests of up to 4 
and they work, but not 5.  It's like slurm is adding one CPU to the request and 
then denying it.

I'm sure I'm missing something fundamental but would appreciate it if someone 
could point out what it is!

Thanks

Simon.
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to