Re: BIND and UDP tuning
Hello again, On Mon, 1 Oct 2018, Alex wrote: > Are your requests being dropped by the service(s)? > > (Or: are you inadvertently abusing the said service(s)?) I don't believe so - often times a follow-up host query succeeds without issue. It's also failing for invaluement and spamhaus, both of which we subscribe. [...] It also tends to happen in bulk - there may be 25 SERVFAILs within the same second, then nothing for another few minutes. Hmmm. If it isn't the modem and it isn't the BLs then it more or less has to be the service, no? I'd be tempted by Mr. Clegg's suggestion to spin up a VPS somewhere with decent connection, which will at least offload a lot of retries. Talk to it through OpenVPN, which is very easy to set up, and it can (a) put the VPS on your LAN (b) take much unreliablility out of the presumably unreliable connection between you and the VPS and (c) write very verbose logs if you wish. On occasion on unreliable connections I've had to use TCP for the VPN link but UDP is the norm - OpenVPN has its own ways of dealing with lost packets. Then you'll probably have a whole new can of worms to investigate, but the worms will definitely tell you something. :) -- 73, Ged. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND and UDP tuning
Alex wrote on 9/30/2018 7:27 PM: Hi, On Sun, Sep 30, 2018 at 1:19 PM @lbutlr wrote: On 30 Sep 2018, at 09:59, Alex wrote: It also tends to happen in bulk - there may be 25 SERVFAILs within the same second, then nothing for another few minutes. That really makes it seem like either you modem or you ISP is interfering somehow, or is simply not able to keep up. I'm leaning towards that, too. The problem persists even when using the provider's DNS servers. I thought for sure I'd see some verifiable info from other people having problems with cable, such as from dslreports, etc, but there really hasn't been anything. The comment made about DOCSIS earlier in this thread was helpful. Do you believe it could be impacting all data, not just bind/DNS/UDP? Do people not generally use cable as even a fallback link for secondary services? I figured it was because there's no SLA, not because it doesn't work well with many protocols. I'd imagine services like Netflix and youtube don't have problems is because they 1) don't require a lot of DNS traffic and 2) http is a really simple protocol and 3) the link is probably engineered to be used for that? Overall it probably depends on volume and application. Cable works well as a transport, but is not the same as DSL, ethernet, or GPON. If you have the need to send 500+ pps, then Cable may not meet your needs. If you are running a high volume mail server you probably do need to run a local resolver to query services like SpamHaus, SORBs, and others due to the terms of service of these services and the rate limiting that they apply which would prevent you from using your upstream provider's DNS servers or a public DNS service like Google/Quad9/1.1.1.1. I would, however, recommend that you ensure your system has at least 2 resolvers configured in /etc/resolv.conf. If the first (local resolver) fails to resolve a query, then your system should retry the second server before giving up and returning a failure to Postfix. Again, if you're using free RBL services that second resolver may need to be one of your own and not one shared with other folks. The occasional timeout might delay email, but should not prevent SMTP from functioning because A) DNS timeouts are considered to be a temporary error, and B) the default behavior of SMTP is to queue and retry if there is a timeout or temporary failure. Another angle to look at the problem from is if you believe the network can't handle more than X query volume, reduce your query volume below X to see if this resolves your issue. I operate dozens of email servers and they do not generate the query volume you describe. Perhaps you are querying too many RBLs and it may pay to be more selective. I find SpamHaus and SpamCop to be the best two RBLs. If you want to pick another one or two, that seems reasonable. I would not recommend more RBLs within Postfix. --Blake ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND and UDP tuning
On 9/30/18, Alex wrote: > Hi, > > On Sun, Sep 30, 2018 at 1:19 PM @lbutlr wrote: >> >> On 30 Sep 2018, at 09:59, Alex wrote: >> > It also tends to happen in bulk - there may be 25 SERVFAILs within the >> > same second, then nothing for another few minutes. >> >> That really makes it seem like either you modem or you ISP is interfering >> somehow, or is simply not able to keep up. > > I'm leaning towards that, too. The problem persists even when using > the provider's DNS servers. Is this a personal project or can you get help from the network staff & open trouble tickets with the various providers? I'm making a big guess here, but you mentioned dnsbl.sorbs.net earlier so $ dig dnsbl.sorbs.net. <.. snip ..> ;; ANSWER SECTION: dnsbl.sorbs.net.86400 IN A 113.52.8.154 dnsbl.sorbs.net.86400 IN A 113.52.8.155 dnsbl.sorbs.net.86400 IN A 208.43.139.188 dnsbl.sorbs.net.86400 IN A 113.52.8.153 dnsbl.sorbs.net.86400 IN A 208.43.110.204 go here: https://wq.apnic.net/apnic-bin/whois.pl and search for 113.52.8.154 which gives me inetnum:113.52.8.0 - 113.52.8.255 netname:DIGITALSENSE descr: Digital Sense, Data Centres, Brisbane, Colocation country:AU on the other hand https://whois.arin.net/rest/net/NET-208-43-0-0-1/pft?s=208.43.139.188 gives ms CityDallas State/Province TX If this is a packet drop issue as well as a personal project, you might be stuck with figuring out just how fast you can send queries before things start to break and adjusting your setup accordingly. > I thought for sure I'd see some verifiable > info from other people having problems with cable, such as from > dslreports, etc, but there really hasn't been anything. The comment > made about DOCSIS earlier in this thread was helpful. > > Do you believe it could be impacting all data, not just bind/DNS/UDP? > > Do people not generally use cable as even a fallback link for > secondary services? I figured it was because there's no SLA, not > because it doesn't work well with many protocols. I think it's more of a you pay for what you get thing. "business class" cable costs more & might even be provisioned better, but at least the first question you get when calling support isn't "have you tried turning it off and on?" wrt your earlier I attempted to search github for query.c line 8580 there's probably a github answer; I went to https://ftp.isc.org/isc/bind9/ found my release and downloaded the BIND-xxx.tar.gz source code file. It'd be nice if ISC made no response to a query a separate error vs. lumping it in with all the other "Something has gone wrong." possibilities. Lee ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND and UDP tuning
Hi, > > It also tends to happen in bulk - there may be 25 SERVFAILs within > > the same second, then nothing for another few minutes. > > Hmmm. If it isn't the modem and it isn't the BLs then it more or less > has to be the service, no? Yes, most likely, but I was looking for more definitive proof that the circuit wasn't doing what it should be (or at least, what I expect). I also wasn't sure if it was a tuning issue (network, bind, server itself, etc). > I'd be tempted by Mr. Clegg's suggestion to spin up a VPS somewhere > with decent connection, which will at least offload a lot of retries. I built an encrypted tunnel using socat with a VPS and a decent connection and the bind SERVFAIL messages almost entirely went away. The remaining ones seem to be actual SERVFAIL problems. > Then you'll probably have a whole new can of worms to investigate, but > the worms will definitely tell you something. :) Yeah, socat isn't a good permanent solution. Looks like I'll get libreswan going. Building a VPN for a specific port/service is a little more difficult, I believe. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND and UDP tuning
Hi, On Mon, Oct 1, 2018 at 9:58 AM Blake Hudson wrote: > > Alex wrote on 9/30/2018 7:27 PM: > > Hi, > > > > On Sun, Sep 30, 2018 at 1:19 PM @lbutlr wrote: > >> On 30 Sep 2018, at 09:59, Alex wrote: > >>> It also tends to happen in bulk - there may be 25 SERVFAILs within the > >>> same second, then nothing for another few minutes. > >> That really makes it seem like either you modem or you ISP is interfering > >> somehow, or is simply not able to keep up. > > I'm leaning towards that, too. The problem persists even when using > > the provider's DNS servers. I thought for sure I'd see some verifiable > > info from other people having problems with cable, such as from > > dslreports, etc, but there really hasn't been anything. The comment > > made about DOCSIS earlier in this thread was helpful. > > > > Do you believe it could be impacting all data, not just bind/DNS/UDP? > > > > Do people not generally use cable as even a fallback link for > > secondary services? I figured it was because there's no SLA, not > > because it doesn't work well with many protocols. I'd imagine services > > like Netflix and youtube don't have problems is because they 1) don't > > require a lot of DNS traffic and 2) http is a really simple protocol > > and 3) the link is probably engineered to be used for that? > > > > Overall it probably depends on volume and application. Cable works well > as a transport, but is not the same as DSL, ethernet, or GPON. If you > have the need to send 500+ pps, then Cable may not meet your needs. I believe I said as many as 500 qps, but I believe that's wrong. It's more like a sustained 200 q/s. > If you are running a high volume mail server you probably do need to run > a local resolver to query services like SpamHaus, SORBs, and others due > to the terms of service of these services and the rate limiting that Yes, doing all of that. That's why I'm posting to the bind-users list. For RBLs, I'm using invaluement (amazing), spamhaus, spamcop, sorbs, senderscore and barracuda. > they apply which would prevent you from using your upstream provider's > DNS servers or a public DNS service like Google/Quad9/1.1.1.1. I would, > however, recommend that you ensure your system has at least 2 resolvers > configured in /etc/resolv.conf. If the first (local resolver) fails to > resolve a query, then your system should retry the second server before That turned out to be a key factor in this. I managed to get rid of most of the SERVFAIL bind errors after tunneling them through socat temporarily, but there were still a few others. I thought by using just one entry in /etc/resolv.conf, it would force all to go through there, but apparently some were dropped(?). It wasn't until I added another resolver on a local network (also on that cable connection) that the 'Name service error' postfix errors really stopped. > The occasional timeout might delay email, but should not prevent SMTP > from functioning because A) DNS timeouts are considered to be a > temporary error, and B) the default behavior of SMTP is to queue and It doesn't prevent the email from being delivered, but the RBL queries time out and consequently don't get consulted, perhaps allowing email to pass that otherwise shouldn't have. > retry if there is a timeout or temporary failure. Another angle to look > at the problem from is if you believe the network can't handle more than > X query volume, reduce your query volume below X to see if this resolves > your issue. I operate dozens of email servers and they do not generate > the query volume you describe. Perhaps you are querying too many RBLs I've also experimented with QoS, prioritizing interactive traffic like DNS, and it appears to help, but I don't believe it's a bandwidth issue. The errors also sometimes happen when processing only a few emails. For a while I thought it couldn't be a bandwidth issue because it's a 165/35mbit link, and we have 10mbit ethernet links where it doesn't ever happen with otherwise very similar configurations, but now I know (or are pretty sure) it's apparently because of the make-up of how the cable (DOCSIS?) is designed... ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: BIND and UDP tuning
Hi Alex, On Mon, 1 Oct 2018 12:51:46 -0400 Alex wrote: > I believe I said as many as 500 qps, but I believe that's wrong. It's > more like a sustained 200 q/s. One other thing you might double check is whether or not any consumer equipment (cable modem, router) has a firewall setting that could be interfering. My newest router came with a built-in DDOS protection feature, which caused me some difficulty with UDP applications until I disabled it. The default threshold for UDP was something like 200 or 300 pps. The manual isn't clear on how the "protection" works, but I assume it starts dropping packets on the floor when the threshold is exceeded. I turned off that feature and the problem went away. Apologies if you've already looked into this; long thread and I'm jumping in late. -s ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Forward type "only" no longer working (possibly a regression)?
You need to get to the root cause of what's causing the SERVFAIL. Removing "forward only" just masks the problem by telling the resolver algorithm to fail over to iterative resolution when it encounters the SERVFAIL. So, sure, the query resolves, but probably not from the correct server, and probably returns the wrong result. The SERVFAIL is the proximate cause, and until you solve that, your forwarding to that server isn't going to work. Why this behavior changed from BIND 9.10.x to 9.13.x eludes me. It's a clue to help you get to the root cause of the SERVFAIL, but only a clue. More diagnosis and testing is required. - Kevin On Sat, Sep 29, 2018 at 7:45 PM Karol Babioch wrote: > Hi, > > after upgrading my bind installation from 9.10.0 to 9.13.3 I'm > encoutering issues with zones that are forwarded. My setup is somewhat > complicated, but I was able to simplify it, so hopefully explanations. > > Basically I have a split horizon DNS, so on my local resolver I'm > forwarding specific requests to an internal authoritative nameserver. > > The named.conf looks somewhat like this: > > > options { > > listen-on port 53 { 127.0.0.1; 10.24.0.1; }; > > listen-on-v6 port 53 { ::1; }; > > directory "/var/named"; > > pid-file "/run/named/named.pid"; > > recursion yes; > > allow-query { localhost; 10.24.0.0/16; }; > > }; > > > > include "/etc/named.rfc1912.zones"; > > include "/etc/named.root.zone"; > > > > zone "babioch.de" IN { > > type forward; > > forward only; > > forwarders { 10.24.0.10; }; > > }; > > This used to work fine before the upgrade, but it fails now. When using > this resolver, I'm running into the following issue: > > > dig @127.0.0.1 mail.babioch.de > > > > ; <<>> DiG 9.13.3 <<>> @127.0.0.1 mail.babioch.de > > ; (1 server found) > > ;; global options: +cmd > > ;; Got answer: > > ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 28826 > > ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 > > > > ;; OPT PSEUDOSECTION: > > ; EDNS: version: 0, flags:; udp: 4096 > > ; COOKIE: bfe2507f6ca6291e360aae925bb00ba55dc000437399d823 (good) > > ;; QUESTION SECTION: > > ;mail.babioch.de. IN A > > > > ;; Query time: 1 msec > > ;; SERVER: 127.0.0.1#53(127.0.0.1) > > ;; WHEN: So Sep 30 01:32:53 CEST 2018 > > ;; MSG SIZE rcvd: 72 > > As you can see the status is "SERVFAIL" and no response is returned. The > query log for this contains the following information: > > > Sep 30 01:33:31 kvm1.babioch.de named[16298]: client @0x7ff0c40ad670 > 127.0.0.1#51022 (mail.babioch.de): query: mail.babioch.de IN A +E(0)K > (127.0.0.1) > > Sep 30 01:33:31 kvm1.babioch.de named[16298]: client @0x7ff0c40ad670 > 127.0.0.1#51022 (mail.babioch.de): query failed (SERVFAIL) for > mail.babioch.de/IN/A at query.c:10672 > > The line in question is handling stale answers [1]. I'm not entirely > sure how this applies to my use-case, since nothing should be stale here. > > Interestingly enough I can get it working, when I'm removing the > "forward only" directive from my configuration. This looks like this: > > > dig @127.0.0.1 mail.babioch.de > > > > ; <<>> DiG 9.13.3 <<>> @127.0.0.1 mail.babioch.de > > ; (1 server found) > > ;; global options: +cmd > > ;; Got answer: > > ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35694 > > ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 > > > > ;; OPT PSEUDOSECTION: > > ; EDNS: version: 0, flags:; udp: 4096 > > ; COOKIE: 85a469a4288546b7b55ea9a65bb00cc92c9846104f77fdab (good) > > ;; QUESTION SECTION: > > ;mail.babioch.de. IN A > > > > ;; ANSWER SECTION: > > mail.babioch.de. 300 IN A 10.24.0.20 > > > > ;; Query time: 129 msec > > ;; SERVER: 127.0.0.1#53(127.0.0.1) > > ;; WHEN: So Sep 30 01:37:45 CEST 2018 > > ;; MSG SIZE rcvd: 88 > > I definitely want the "forward only" directive to make sure that only > nameservers specified in the "forwarders" directive are contacted - in > all cases. This seems no longer to be possible. I couldn't find any > description of this in the change log, so this seems to be a bug and/or > regression to me. > > What do you think? Can anyone verify this? Am I missing or > mis-understanding something here? > > Thank you very much! > > Best regards, > Karol Babioch > > [1]: > > https://gitlab.isc.org/isc-projects/bind9/blob/v9_13_3/lib/ns/query.c#L10672 > > ___ > Please visit https://lists.isc.org/mailman/listinfo/bind-users to > unsubscribe from this list > > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users > ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Forward type "only" no longer working (possibly a regression)?
Hi Kevin, thanks for your reply. Am 01.10.18 um 20:46 schrieb Kevin Darcy: > Removing "forward only" just masks the problem by telling the resolver > algorithm to fail over to iterative resolution when it encounters the > SERVFAIL. You are probably right, but at least it chooses the correct nameserver / interface to use for the iterative resolution. When I don't specify any forwarders at all, it will resolve using the public nameserver, with the forwarders specified it will use the internal nameserver. So it does work (at least it returns the correct results). > Why this behavior changed from BIND 9.10.x to 9.13.x eludes me. It's a > clue to help you get to the root cause of the SERVFAIL, but only a clue. > More diagnosis and testing is required. Do you have any suggestion / recommendation what I can do to narrow the problem down? I already tried to increase the tracing and enabled query logging, but I couldn't get to the bottom of things. What else can I do here? Best regards, Karol Babioch signature.asc Description: OpenPGP digital signature ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Forward type "only" no longer working (possibly a regression)?
Hi, Am 01.10.18 um 21:10 schrieb Karol Babioch: > Do you have any suggestion / recommendation what I can do to narrow the > problem down? I already tried to increase the tracing and enabled query > logging, but I couldn't get to the bottom of things. What else can I do > here? as an additional data point, this is what I get with debugging (level 9): > 01-Oct-2018 21:25:52.976 query-errors: debug 1: client @0x7f89401d4c10 > 10.24.0.1#51206 (mail.babioch.de): view internal: query failed (SERVFAIL) for > mail.babioch.de/IN/A at query.c:10672 > 01-Oct-2018 21:25:52.976 query-errors: debug 2: fetch completed at > resolver.c:9094 for mail.babioch.de/A in 0.030641: SERVFAIL/success > [domain:babioch.de,referral:2,restart:0,qrysent:0,timeout:0,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0,qminsteps:2] I really don't get it, the fetch completes just fine according to this (SERVFAIL/success). Also the querylog does not indicate what the issue is: > Okt 01 21:30:53 kvm1.babioch.de named[17380]: client @0x7f15e8056140 > 10.24.0.1#58354 (mail.babioch.de): view internal: query: mail.babioch.de IN A > +E(0)K (10.24.0.1) > Okt 01 21:30:53 kvm1.babioch.de named[17380]: client @0x7f15e8056140 > 10.24.0.1#58354 (mail.babioch.de): view internal: query failed (SERVFAIL) for > mail.babioch.de/IN/A at query.c:10672 Can one of you BIND gurus see what's wrong here? What else can/should I try. I'm pretty much out of ideas for now. Thank you very much in advance! Best regards, Karol Babioch signature.asc Description: OpenPGP digital signature ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Forward type "only" no longer working (possibly a regression)?
One of the most useful initial steps in troubleshooting is to establish your ability to reproduce the error. So, I'd look at getting to the command-line of the originating resolver, if possible, and using a command-line tool like "dig" to generate queries towards the intended target and see if you get the same SERVFAIL result. In order to *exactly* replicate the queries, however, you need to understand what "+E(0)K" in the log means. That's recursive-desired (the default for a query generated by a command-line tool like "dig"), EDNS0 and a DNSCOOKIE requested. Supposedly, modern versions of "dig" will set EDNS0 and DNSCOOKIE by default, so you might be lucky and a straight "dig" with no special options will replicate the error. If not, you may need to get your hands on a more modern version of "dig", or use another tool. Once you've replicated the error, then start changing things. I'd start with turning EDNS0 and/or DNSCOOKIE on and off. Both of those are relatively "modern" extensions to DNS (at least, compared to the "classic" DNS of RFCs 1034 and 1035) and it's possible that the responding server just doesn't deal with them properly. With EDNS0, there are different buffer sizes that could, hypothetically, be tried. But, unless you've tuned that specifically in named.conf, it's should be the case that the "dig" default is the same as the "named" one, so it's unlikely that changing buffer size will produce any change in behavior. It's possible, I suppose... If you can't get any change of behavior by twiddling those things, then one would have to delve deeper. But I won't make this post any longer than it already is :-) That should be enough to get you started... - Kevin On Mon, Oct 1, 2018 at 3:34 PM Karol Babioch wrote: > Hi, > > Am 01.10.18 um 21:10 schrieb Karol Babioch: > > Do you have any suggestion / recommendation what I can do to narrow the > > problem down? I already tried to increase the tracing and enabled query > > logging, but I couldn't get to the bottom of things. What else can I do > > here? > > as an additional data point, this is what I get with debugging (level 9): > > > 01-Oct-2018 21:25:52.976 query-errors: debug 1: client @0x7f89401d4c10 > 10.24.0.1#51206 (mail.babioch.de): view internal: query failed (SERVFAIL) > for mail.babioch.de/IN/A at query.c:10672 > > 01-Oct-2018 21:25:52.976 query-errors: debug 2: fetch completed at > resolver.c:9094 for mail.babioch.de/A in 0.030641: SERVFAIL/success > [domain:babioch.de > ,referral:2,restart:0,qrysent:0,timeout:0,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0,qminsteps:2] > > I really don't get it, the fetch completes just fine according to this > (SERVFAIL/success). Also the querylog does not indicate what the issue is: > > > Okt 01 21:30:53 kvm1.babioch.de named[17380]: client @0x7f15e8056140 > 10.24.0.1#58354 (mail.babioch.de): view internal: query: mail.babioch.de > IN A +E(0)K (10.24.0.1) > > Okt 01 21:30:53 kvm1.babioch.de named[17380]: client @0x7f15e8056140 > 10.24.0.1#58354 (mail.babioch.de): view internal: query failed (SERVFAIL) > for mail.babioch.de/IN/A at query.c:10672 > > Can one of you BIND gurus see what's wrong here? What else can/should I > try. I'm pretty much out of ideas for now. > > Thank you very much in advance! > > Best regards, > Karol Babioch > > ___ > Please visit https://lists.isc.org/mailman/listinfo/bind-users to > unsubscribe from this list > > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users > ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users