Hi;

Do not think "the number of devices" as "the number of servers". If a devices which have a MAC address and connected to your node's local networks, it counts as a device. For example, if your BMC ports (ILO,iDRAC etc.) connected to one of the networks of your nodes, it doubles the number of devices.

To test the arp issue, you can keep pinging from the slurmctl server to the problematic node, and from the problematic node to the slurmctl server. Continuous pinging will keep the node in the arp table.

Ahmet M.


On 15.05.2019 11:54, Bill Broadley wrote:
On 5/15/19 12:34 AM, Barbara Krašovec wrote:
It could be a problem with ARP cache.

If the number of devices approaches 512, there is a kernel limitation in dynamic
ARP-cache size and it can result in the loss of connectivity between nodes.
We have 162 compute nodes, a dozen or so file servers, head node, transfer node,
and not much else.  Despite significant tinkering I never got a DNS lookup
(forward or reverse), ping, nslookup, dig, ssh, or telnet to the
slurmd/slurmctld ports to fail.

All while /var/log/syslog was complaining that DNS wasn't working to the same 20
nodes.

 From what I can tell slurm builds a tree of hosts to check, a compute node
checks out 20 or so hosts.  From what I can tell slurmd caches the dns results
(which is good), but also caches the DNS non-results.  So even while I'm logged
into a node verifying that both DNS servers and lookup all the down hosts
forward and backwards syslog is still complaining often about failures in DNS
lookups.

What's worse is this still caused problems even when that node was put in drain
mode.  So all 20+ hosts (of 160) would bounce between online (alloc/idle) to
offline (alloc*/idle*).  If it got unlucky and had a few in a row the node would
timeout, be marked down, and all the jobs killed.

This is with slurm 18.08.7 that I compiled for Ubuntu LTS 18.04.

The garbage collector will run if the number of entries in the cache is less
than 128, by default:
I checked the problematic host (the one that frequently complained that 20 hosts
had no DNS) and it had 116 arp entries.

[ snipped much useful sysctl info ]

Or just insert in /etc/sysctl.con
Many thanks, useful stuff that I'll keep in my notes.  In this case though I
think the slurm "tree" is improperly caching the absence of DNS records.

I checked for a single host and:
bigmem1# cat /var/log/syslog| grep c6-66 |grep "May 14"| wc -l
51
root@bigmem1:/var/log/slurm-llnl# cat /var/log/syslog| grep c6-66 |grep "May
14"| tail -1
May 14 23:30:22 bigmem1 slurmd[46951]: error: forward_thread: can't find address
for host c6-66, check slurm.conf

So despite having /etc/resolv.conf point directly to two name servers that could
  lookup c6-66 -> 10.17.6.66 or 10.17.6.66 -> c6-66 it kept telling the slurm
controller that c6-66 didn't exist.  During that time bigmem1 could ssh, telnet,
dig, nslookup, to c6-66.

I suspect bigmem1 was assigned the slurm node check tree last Wednesday when we
provisioned those nodes.  The entries might well have been put into slurm before
they were put into DNS (managed by cobbler).  Then bigmem1 caches those negative
records since Wednesday and kept informing the slurm controller that they didn't
exist.

A reboot of bigmem1 fixed the problem.



Reply via email to