[slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-28 Thread Paul Raines
I finally had downtime on our cluster running 20.11.3 and decided to upgrade SLURM. All daemons were stopped on nodes and master. Rocky 8 Linux OS was updated but not changed configuration-wise in anyway. On the master, when I first installed 23.11.1 and tried to run slurmdbd -D -vvv at the co

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-28 Thread Paul Raines
Some more info on what I am seeing after the 23.11.3 upgrade. Here is a case where a job is cancelled but seems permanently stuck in 'CG' state in squeue [2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated [2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06 #CPUs