On 6/2/22 14:02, tluchko wrote:
Hello,
I have recently started to have problems where jobs sit in the queue
waiting for resources to become available, even when the resources are
available. If I stop and restart slurmctld, the pending jobs start running.
This seems to be related to GRES jobs. I have seven nodes with
Gres=bandwidth:ib:no_consume:1G
four nodes with
Gres=gpu:gtx_titan_x:4,bandwidth:ethernet:no_consume:1G
and one node with.
Gres=gpu:rtx_2080_ti:4,bandwidth:ethernet:no_consume:1G
Jobs only sit in the queue with RESOURCES as the REASON when we include
the flag --gres=bandwidth:ib. If we remove the flag, the jobs run fine.
But we need the flag to ensure that we don't get a mix of IB and
ethernet nodes because they fail in this case.
It seems that once a node completes a job with --gres=bandwidth:ib it
won't run another job with this setting until I restart slurmctld.
The only error I can find is in /var/log/slurm/slurmctld.log
[2022-05-31T03:27:49.144] error: gres/bandwidth: _step_dealloc
StepId=140569.0 dealloc, node_in_use is NULL
These jobs were running consistently but then started giving us trouble
about a month ago. I have tried restarting slurmd on all nodes and
slurmctld. Restarting slurmctld does provide a temporary fix.
I'm using Slurm 21.08.3 and Rocky Linux release 8.5.
Do you have any suggestions as to what is wrong or how to fix it?
Thank you,
Tyler
Another alternate way to deal with this is the topology plugin. We use
this to keep jobs from spanning two different infiniband fabrics that
are connected together via lower bandwidth than the rest of the fabric.
--
#BlackLivesMatter
____
|| \\UTGERS, |----------------------*O*------------------------
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark
`'