Re: [slurm-users] Need to restart slurmctld for gres jobs to start

Ryan Novosielski Fri, 24 Jun 2022 13:58:51 -0700

On 6/2/22 14:02, tluchko wrote:

Hello,
I have recently started to have problems where jobs sit in the queuewaiting for resources to become available, even when the resources areavailable. If I stop and restart slurmctld, the pending jobs start running.
This seems to be related to GRES jobs.  I have seven nodes with

Gres=bandwidth:ib:no_consume:1G

four nodes with

Gres=gpu:gtx_titan_x:4,bandwidth:ethernet:no_consume:1G

and one node with.

Gres=gpu:rtx_2080_ti:4,bandwidth:ethernet:no_consume:1G
Jobs only sit in the queue with RESOURCES as the REASON when we includethe flag --gres=bandwidth:ib. If we remove the flag, the jobs run fine. But we need the flag to ensure that we don't get a mix of IB andethernet nodes because they fail in this case.
It seems that once a node completes a job with --gres=bandwidth:ib itwon't run another job with this setting until I restart slurmctld.
The only error I can find is in /var/log/slurm/slurmctld.log
[2022-05-31T03:27:49.144] error: gres/bandwidth: _step_deallocStepId=140569.0 dealloc, node_in_use is NULL
These jobs were running consistently but then started giving us troubleabout a month ago. I have tried restarting slurmd on all nodes andslurmctld. Restarting slurmctld does provide a temporary fix.
I'm using Slurm 21.08.3 and Rocky Linux release 8.5.

Do you have any suggestions as to what is wrong or how to fix it?

Thank you,

Tyler

Another alternate way to deal with this is the topology plugin. We usethis to keep jobs from spanning two different infiniband fabrics thatare connected together via lower bandwidth than the rest of the fabric.


--
#BlackLivesMatter
____
 || \\UTGERS,     |----------------------*O*------------------------
 ||_// the State  |    Ryan Novosielski - novos...@rutgers.edu
 || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
 ||  \\    of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
      `'

Re: [slurm-users] Need to restart slurmctld for gres jobs to start

Reply via email to