I finally had downtime on our cluster running 20.11.3 and decided to
upgrade SLURM. All daemons were stopped on nodes and master.
Rocky 8 Linux OS was updated but not changed configuration-wise
in anyway.
On the master, when I first installed 23.11.1 and tried to run
slurmdbd -D -vvv at the co
Some more info on what I am seeing after the 23.11.3 upgrade.
Here is a case where a job is cancelled but seems permanently
stuck in 'CG' state in squeue
[2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated
[2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06
#CPUs