Hi Janna,
If you're running an old Slurm version, there may be bugs already resolved
in the later versions. You can search for bugs with ReqNodeNotAvail in
the title:
https://bugs.schedmd.com/buglist.cgi?quicksearch=ReqNodeNotAvail
For example, this one might be relevant:
https://bugs.schedm
In case your Arp cache is the problem, there is some advice in the Wiki
page:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
I think there are other causes for ReqNodeNotAvail, for example, the
node being allocated for other jobs. The "scontrol sh
Hi Janna;
It sounds like a Arp cache table problem to me. If your slurm head node
can reachable ~1000 or more network devices (all connected network
cards, switches etc., even they are reachable by different ports of the
server), you need to increse some network settings at headnode and
serve
On Friday, 10 July 2020 3:34:44 PM PDT Janna Ore Nugent wrote:
> I’ve got an intermittent situation with gpu nodes that sinfo says are
> available and idle, but squeue reports as “ReqNodeNotAvail”. We’ve cycled
> the nodes to restart services but it hasn’t helped. Any suggestions for
> resolving
Hi All,
I’ve got an intermittent situation with gpu nodes that sinfo says are available
and idle, but squeue reports as “ReqNodeNotAvail”. We’ve cycled the nodes to
restart services but it hasn’t helped. Any suggestions for resolving this or
digging into it more deeply?
Thanks,
Janna
Janna