On 5/15/19 12:34 AM, Barbara Krašovec wrote: > It could be a problem with ARP cache. > > If the number of devices approaches 512, there is a kernel limitation in > dynamic > ARP-cache size and it can result in the loss of connectivity between nodes.
We have 162 compute nodes, a dozen or so file servers, head node, transfer node, and not much else. Despite significant tinkering I never got a DNS lookup (forward or reverse), ping, nslookup, dig, ssh, or telnet to the slurmd/slurmctld ports to fail. All while /var/log/syslog was complaining that DNS wasn't working to the same 20 nodes. >From what I can tell slurm builds a tree of hosts to check, a compute node checks out 20 or so hosts. From what I can tell slurmd caches the dns results (which is good), but also caches the DNS non-results. So even while I'm logged into a node verifying that both DNS servers and lookup all the down hosts forward and backwards syslog is still complaining often about failures in DNS lookups. What's worse is this still caused problems even when that node was put in drain mode. So all 20+ hosts (of 160) would bounce between online (alloc/idle) to offline (alloc*/idle*). If it got unlucky and had a few in a row the node would timeout, be marked down, and all the jobs killed. This is with slurm 18.08.7 that I compiled for Ubuntu LTS 18.04. > The garbage collector will run if the number of entries in the cache is less > than 128, by default: I checked the problematic host (the one that frequently complained that 20 hosts had no DNS) and it had 116 arp entries. [ snipped much useful sysctl info ] > Or just insert in /etc/sysctl.con Many thanks, useful stuff that I'll keep in my notes. In this case though I think the slurm "tree" is improperly caching the absence of DNS records. I checked for a single host and: bigmem1# cat /var/log/syslog| grep c6-66 |grep "May 14"| wc -l 51 root@bigmem1:/var/log/slurm-llnl# cat /var/log/syslog| grep c6-66 |grep "May 14"| tail -1 May 14 23:30:22 bigmem1 slurmd[46951]: error: forward_thread: can't find address for host c6-66, check slurm.conf So despite having /etc/resolv.conf point directly to two name servers that could lookup c6-66 -> 10.17.6.66 or 10.17.6.66 -> c6-66 it kept telling the slurm controller that c6-66 didn't exist. During that time bigmem1 could ssh, telnet, dig, nslookup, to c6-66. I suspect bigmem1 was assigned the slurm node check tree last Wednesday when we provisioned those nodes. The entries might well have been put into slurm before they were put into DNS (managed by cobbler). Then bigmem1 caches those negative records since Wednesday and kept informing the slurm controller that they didn't exist. A reboot of bigmem1 fixed the problem.