Hi everyone, Is there a guide anywhere on how to figure out why jobs aren't being started?
We have a cluster with nodes of mixed sizes/powers currently roughly half the cluster is idle even though there are ~5k jobs queued. All jobs are queued due to priority while only 1 job is marked as waiting for resources, the job waiting for resources needs a tiny bit more RAM then slurm shows available for all the idle nodes (62.5G vs 62G in sinfo but if you ask 'free -m' it would be 62.8G). So the job would seem to be waiting for larger nodes which is fine but what I don't understand is why the several thousand other jobs that have very modest memory requests 4-8G aren't starting on the small nodes. We're using Slurm 2017.11 with the sched/backfill scheduler. My logic says that if the job is waiting for the larger nodes then the smaller jobs can easily be filled with small jobs without harming its' start time.... So how/where can I see what/why slurm is not starting these jobs. One other thing when I submit a tiny job (hostname) slurm doesn't even stick it on the idle nodes but instead fits it in on the nodes currently already in use, only if I explicitely request an idle node does the test job go there. Thanks! Eli