On 27/09/2018 16.53, Alex wrote: > Hi, > >>> I reported a few weeks ago that I was experiencing a really high >>> number of "SERVFAIL" messages in my bind-9.11.4-P1 system running on >>> fedora28, and I haven't yet found a solution. This is all now running >>> on a 165/35 cable system. >>> >>> I found a program named dropwatch which is showing a significant >>> number of dropped UDP packets, particularly when there are bursts of >>> email traffic: >>> >>> 12 drops at skb_queue_purge+13 (0xffffffff9f79a0c3) >>> 1 drops at __udp4_lib_rcv+1e6 (0xffffffff9f83bdf6) >>> 4 drops at __udp4_lib_rcv+1e6 (0xffffffff9f83bdf6) >>> 5 drops at nf_hook_slow+a7 (0xffffffff9f7faff7) >>> 3 drops at sk_stream_kill_queues+48 (0xffffffff9f7a1158) >>> 3 drops at __udp4_lib_rcv+1e6 (0xffffffff9f83bdf6) >>> ... >>> >>> # netstat -us >>> ... >>> Udp: >>> 23449482 packets received >>> 1724269 packets to unknown port received >>> 8248 packet receive errors >>> 31394909 packets sent >>> 8243 receive buffer errors >>> 0 send buffer errors >>> InCsumErrors: 5 >>> IgnoredMulti: 43247 >>> >>> The SERVFAIL messages don't necessarily correspond to the UDP packet >>> errors shown by netstat, but the dropwatch output is continuous. The >>> netstat packet receive errors also don't seem to correspond to >>> "SERVFAIL" or "Name service" errors: >>> >>> 26-Sep-2018 12:42:49.743 query-errors: info: client @0x7fb3c41634d0 >>> 127.0.0.1#44104 (46.36.47.104.wl.mailspike.net): query failed >>> (SERVFAIL) for 46.36.47.104.wl.mailspike.net/IN/A at >>> ../../../bin/named/query.c:8580 >>> >>> Sep 26 12:47:11 mail03 postfix/dnsblog[22821]: warning: dnsblog_query: >>> lookup error for DNS query 196.91.107.80.bl.spameatingmonkey.net: Host >>> or domain name not found. Name service error for >>> name=196.91.107.80.bl.spameatingmonkey.net type=A: Host not found, try >>> again >>> >>> I've been following this thread from some time ago, but nothing I've >>> done has made a difference. I really don't know what the buffer sizes >>> should be. >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__bind-2Dusers- >>> 2Dforum.2342410.n4.nabble.com_Tuning-2Dsuggestions-2Dfor-2Dhigh-2Dcore- >>> 2Dcount-2DLinux-2Dservers- >>> 2Dtd3899.html&d=DwICAg&c=MOptNlVtIETeDALC_lULrw&r=udvvbouEjrWNUMab5xo_vLb >>> UE6LRGu5fmxLhrDvVJS8&m=5XQNuuRQ4kxK03zqoWaJHIdaJvNdsyTKHuFlDKedbpc&s=5Dqh >>> ne-5w5V_1coBTBvTITwK2EFeankOegTaofy8S5w&e= >>> >>> Are there specific bind tunables you might recommend? edns-udp-size, >>> perhaps? >>> >>> Any ideas on other tunables such as net.core.*mem_default etc? >> *chuckles to self* >> >> I was just referring back to that thread myself to try remember what I did. >> >> I ended up tuning the following items: >> >> - name: SYSCTL system tuning, basics >> sysctl: >> name: "{{ item.name }}" >> value: "{{ item.value }}" >> sysctl_set: yes >> state: present >> with_items: >> - { name: 'vm.swappiness', value: 0 } >> - { name: 'net.core.netdev_max_backlog', value: 32768 } >> - { name: 'net.core.netdev_budget', value: 2700 } >> - { name: 'net.ipv4.tcp_sack', value: 0 } >> - { name: 'net.core.somaxconn', value: 2048 } >> - { name: 'net.core.rmem_default', value: 16777216 } >> - { name: 'net.core.rmem_max', value: 16777216 } >> - { name: 'net.core.wmem_default', value: 16777216 } >> - { name: 'net.core.wmem_max', value: 16777216 } > Were you troubleshooting the same problems as I'm experiencing? > > Many of these values I've already tweaked and have had no effect on my > SERVFAIL issues :-( > > I've also been following the performance tuning variables in this RH document: > https://access.redhat.com/sites/default/files/attachments/20150325_network_performance_tuning.pdf > > These errors appear to occur in spurts - there is typically ten or > more in a row at a time, then any number of minutes/seconds before the > next one. > > It looks like there are periods of as many as 500 queries per second, > although the usual amount is closer to 200 per second. > > I don't believe this is a bind configuration problem, as the "Name > service error" errors from postfix also occur when testing with > unbound. > > This is also only happening on the two identical systems connected to > the 165/35mbit cable modem. I've verified with Oponline, and they've > emphatically asserted there are no problems with the circuit. The > systems are 8-core Xeon E31240 with 16GB RAM. I've also tried other > systems, including a 12-core i7 with 32GB. > > We have several other systems connected to a 10mbit DIA ethernet > circuit where these errors don't generally occur. They are also > similarly configured fedora systems with the same version of bind. > > I'm really at a loss as to what the problem(s) are, but feel like it's > really impacting our ability to query RBLs for processing mail. > >> Whilst mentioned in passing on that thread, there was also poking around >> with TOE, pause, coalesce adaptive and ring size settings (look at ethtool >> -K, ethtool -A, ethtool -C and ethtool -G), but sadly have lost the specific >> commands. > I've also tried configuring the NIC with ethtool according to the > variables defined in the RH document listed above and have had no > success. > > This really is just a stock system. I can't believe these problems > would be so elusive or uncommon. Could it have to do with some > characteristic of the cable circuit itself? Just a wild thought: It works with a lower speed line (at least I read it that way) but has problems with higher speeds. Could it be that the line is so fast that it "overtakes" the host in question?
A faster incoming line will give less time between the packets for processing. > > I've also experimented with QoS, using tc to prioritize interactive > traffic, including tcp and udp port 53, with plenty of bandwidth. > > I really hope there is someone with some additional ideas. > Thanks, > Alex > _______________________________________________ > Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe > from this list > > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users
_______________________________________________ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users