Hi Simon,
Maybe you could print the user's limits using this tool:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits
Which version of Slurm do you run?
/Ole
On 3/14/24 17:47, Simon Andrews via slurm-users wrote:
Our cluster has developed a strange intermittent behaviour where jobs are
being put into a pending state because they aren’t passing the
AssocGrpCpuLimit, even though the user submitting has enough cpus for the
job to run.
For example:
$ squeue -o "%.6i %.9P %.8j %.8u %.2t %.10M %.7m %.7c %.20R"
JOBID PARTITION NAME USER ST TIME MIN_MEM MIN_CPU
NODELIST(REASON)
799 normal hostname andrewss PD 0:00 2G 5
(AssocGrpCpuLimit)
..so the job isn’t running, and it’s the only job in the queue, but:
$ sacctmgr list associations part=normal user=andrewss
format=Account,User,Partition,Share,GrpTRES
Account User Partition Share GrpTRES
---------- ---------- ---------- --------- -------------
andrewss andrewss normal 1 cpu=5
That user has a limit of 5 CPUs so the job should run.
The weird thing is that this effect is intermittent. A job can hang and
the queue will stall for ages but will then suddenly start working and you
can submit several jobs and they all work, until one fails again.
The cluster has active nodes and plenty of resource:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up infinite 2 idle compute-0-[6-7]
interactive up 1-12:00:00 3 idle compute-1-[0-1,3]
The slurmctld log just says:
[2024-03-14T16:21:41.275] _slurm_rpc_submit_batch_job: JobId=799
InitPrio=4294901720 usec=259
Whilst it’s in this state I can run other jobs with core requests of up to
4 and they work, but not 5. It’s like slurm is adding one CPU to the
request and then denying it.
//
I’m sure I’m missing something fundamental but would appreciate it if
someone could point out what it is!
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com