On Wed, May 15, 2019 at 3:57 PM Adam Urban <adam.ur...@appleguru.org> wrote: > > We have an application where we are use sendmsg() to send (lots of) > UDP packets to multiple destinations over a single socket, repeatedly, > and at a pretty constant rate using IPv4. > > In some cases, some of these destinations are no longer present on the > network, but we continue sending data to them anyways. The missing > devices are usually a temporary situation, but can last for > days/weeks/months. > > We are seeing an issue where packets sent even to destinations that > are present on the network are getting dropped while the kernel > performs arp updates. > > We see a -1 EAGAIN (Resource temporarily unavailable) return value > from the sendmsg() call when this is happening: > > sendmsg(72, {msg_name(16)={sa_family=AF_INET, sin_port=htons(1234), > sin_addr=inet_addr("10.1.2.3")}, msg_iov(1)=[{"\4\1"..., 96}], > msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = -1 EAGAIN (Resource > temporarily unavailable) > > Looking at packet captures, during this time you see the kernel arping > for the devices that aren't on the network, timing out, arping again, > timing out, and then finally arping a 3rd time before setting the > INCOMPLETE state again (very briefly being in a FAILED state). > > "Good" packets don't start going out again until the 3rd timeout > happens, and then they go out for about 1s until the 3s delay from ARP > happens again. > > Interestingly, this isn't an all or nothing situation. With only a few > (2-3) devices missing, we don't run into this "blocking" situation and > data always goes out. But once 4 or more devices are missing, it > happens. Setting static ARP entries for the missing supplies, even if > they are bogus, resolves the issue, but of course results in packets > with a bogus destination going out on the wire instead of getting > dropped by the kernel. > > Can anyone explain why this is happening? I have tried tuning the > unres_qlen sysctl without effect and will next try to set the > MSG_DONTWAIT socket option to try and see if that helps. But I want to > make sure I understand what is going on. > > Are there any parameters we can tune so that UDP packets sent to > INCOMPLETE destinations are immediately dropped? What's the best way > to prevent a socket from being unavailable while arp operations are > happening (assuming arp is the cause)?
Sounds like hitting SO_SNDBUF limit due to datagrams being held on the neighbor queue. Especially since the issue occurs only as the number of unreachable destinations exceeds some threshold. Does /proc/net/stat/ndisc_cache show unresolved_discards? Increasing unres_qlen may make matters only worse if more datagrams can get queued. See also the branch on NUD_INCOMPLETE in __neigh_event_send.