Hello, I observe a weird behaviour of my SLURM installation (23.02.2). Some tasks take some hours to be scheduled (probably on one specific node), the pending state reason is "Resources", although resources are free. I have tested a bit around and get this weird behaviour for salloc command: "salloc --ntasks=4 --mem-per-cpu=3500M --gres=gpu:1" is waiting for resources, while "salloc --ntasks=4 --mem-per-cpu=3700M --gres=gpu:1" is scheduled directly (while the command above is still waiting)
I have already restarted the slurmd daemon on that node and slurmctld, no changes to that behaviour. This is the node configuration: NodeName=node6 NodeHostname=cluster-node6 Port=17002 CPUs=64 RealMemory=254000 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=gpu:a10:3 Weight=2 State=UNKNOWN Gres.conf: AutoDetect=off Name=gpu Type=a10 File=/dev/nvidia0 Name=gpu Type=a10 File=/dev/nvidia1 Name=gpu Type=a10 File=/dev/nvidia2 What could be the issue here? Regards, Holger