Hello,
I observe a weird behaviour of my SLURM installation (23.02.2). Some tasks take 
some hours to be scheduled (probably on one specific node), the pending state 
reason is "Resources", although resources are free.
I have tested a bit around and get this weird behaviour for salloc command:
"salloc --ntasks=4 --mem-per-cpu=3500M --gres=gpu:1" is waiting for resources, 
while
"salloc --ntasks=4 --mem-per-cpu=3700M --gres=gpu:1" is scheduled directly 
(while the command above is still waiting)

I have already restarted the slurmd daemon on that node and slurmctld, no 
changes to that behaviour.

This is the node configuration:

NodeName=node6 NodeHostname=cluster-node6 Port=17002 CPUs=64 RealMemory=254000 
Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=gpu:a10:3 Weight=2 
State=UNKNOWN

Gres.conf:
AutoDetect=off
Name=gpu Type=a10       File=/dev/nvidia0
Name=gpu Type=a10       File=/dev/nvidia1
Name=gpu Type=a10       File=/dev/nvidia2

What could be the issue here?

Regards,
Holger

Reply via email to