On 6/2/22 14:02, tluchko wrote:
Hello,

I have recently started to have problems where jobs sit in the queue waiting for resources to become available, even when the resources are available. If I stop and restart slurmctld, the pending jobs start running.

This seems to be related to GRES jobs.  I have seven nodes with

Gres=bandwidth:ib:no_consume:1G

four nodes with

Gres=gpu:gtx_titan_x:4,bandwidth:ethernet:no_consume:1G

and one node with.

Gres=gpu:rtx_2080_ti:4,bandwidth:ethernet:no_consume:1G

Jobs only sit in the queue with RESOURCES as the REASON when we include the flag --gres=bandwidth:ib.  If we remove the flag, the jobs run fine.  But we need the flag to ensure that we don't get a mix of IB and ethernet nodes because they fail in this case.

It seems that once a node completes a job with --gres=bandwidth:ib it won't run another job with this setting until I restart slurmctld.

The only error I can find is in /var/log/slurm/slurmctld.log

[2022-05-31T03:27:49.144] error: gres/bandwidth: _step_dealloc StepId=140569.0 dealloc, node_in_use is NULL

These jobs were running consistently but then started giving us trouble about a month ago. I have tried restarting slurmd on all nodes and slurmctld.  Restarting slurmctld does provide a temporary fix.

I'm using Slurm 21.08.3 and Rocky Linux release 8.5.

Do you have any suggestions as to what is wrong or how to fix it?

Thank you,

Tyler

Another alternate way to deal with this is the topology plugin. We use this to keep jobs from spanning two different infiniband fabrics that are connected together via lower bandwidth than the rest of the fabric.

--
#BlackLivesMatter
____
 || \\UTGERS,     |----------------------*O*------------------------
 ||_// the State  |    Ryan Novosielski - novos...@rutgers.edu
 || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
 ||  \\    of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
      `'

Reply via email to