/proc/net/stat/ndisc_cache show unresolved_discards appears to show 0 unresolved_discards:
entries,allocs,destroys,hash_grows,lookups,hits,res_failed,rcv_probes_mcast,rcv_probes_ucast,periodic_gc_runs,forced_gc_runs,unresolved_discards,table_fulls 00000005,00000005,00000000,00000000,00000000,00000000,00000000,00000000,00000000,000021af,00000000,00000000,00000000 00000005,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 On Thu, May 16, 2019 at 10:48 AM Willem de Bruijn <willemdebruijn.ker...@gmail.com> wrote: > > On Wed, May 15, 2019 at 3:57 PM Adam Urban <adam.ur...@appleguru.org> wrote: > > > > We have an application where we are use sendmsg() to send (lots of) > > UDP packets to multiple destinations over a single socket, repeatedly, > > and at a pretty constant rate using IPv4. > > > > In some cases, some of these destinations are no longer present on the > > network, but we continue sending data to them anyways. The missing > > devices are usually a temporary situation, but can last for > > days/weeks/months. > > > > We are seeing an issue where packets sent even to destinations that > > are present on the network are getting dropped while the kernel > > performs arp updates. > > > > We see a -1 EAGAIN (Resource temporarily unavailable) return value > > from the sendmsg() call when this is happening: > > > > sendmsg(72, {msg_name(16)={sa_family=AF_INET, sin_port=htons(1234), > > sin_addr=inet_addr("10.1.2.3")}, msg_iov(1)=[{"\4\1"..., 96}], > > msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = -1 EAGAIN (Resource > > temporarily unavailable) > > > > Looking at packet captures, during this time you see the kernel arping > > for the devices that aren't on the network, timing out, arping again, > > timing out, and then finally arping a 3rd time before setting the > > INCOMPLETE state again (very briefly being in a FAILED state). > > > > "Good" packets don't start going out again until the 3rd timeout > > happens, and then they go out for about 1s until the 3s delay from ARP > > happens again. > > > > Interestingly, this isn't an all or nothing situation. With only a few > > (2-3) devices missing, we don't run into this "blocking" situation and > > data always goes out. But once 4 or more devices are missing, it > > happens. Setting static ARP entries for the missing supplies, even if > > they are bogus, resolves the issue, but of course results in packets > > with a bogus destination going out on the wire instead of getting > > dropped by the kernel. > > > > Can anyone explain why this is happening? I have tried tuning the > > unres_qlen sysctl without effect and will next try to set the > > MSG_DONTWAIT socket option to try and see if that helps. But I want to > > make sure I understand what is going on. > > > > Are there any parameters we can tune so that UDP packets sent to > > INCOMPLETE destinations are immediately dropped? What's the best way > > to prevent a socket from being unavailable while arp operations are > > happening (assuming arp is the cause)? > > Sounds like hitting SO_SNDBUF limit due to datagrams being held on the > neighbor queue. Especially since the issue occurs only as the number > of unreachable destinations exceeds some threshold. Does > /proc/net/stat/ndisc_cache show unresolved_discards? Increasing > unres_qlen may make matters only worse if more datagrams can get > queued. See also the branch on NUD_INCOMPLETE in __neigh_event_send.