Re: file descriptor exceeds limit
On 17.06.15 22:39, Shawn Zhou wrote: BIND on my resolvers reaches the max open file limit and I am getting lots of SERVFAILs http://pastebin.com/SxRsHLff After I increased the max-socks (-s 8192) to 8192, I no longer saw the file limit error from the log anymore; however, I am still many SERVFAILs. no other errors? Our resolvers were doing about 15k queries per seconds when this was happening and those were legit traffic. I am aware that I am setting recursive clients to a very high number. Those resolvers are running on 12-cores cpu and 24G RAM hardware. cpu utilization was at about 20% and plenty of RAM left. I am wondering if I've reached the limit of BIND for the amount of recursive queries it can serve. Any other tunings I should try? maybe changing number of recursive-clients, max-clients-per-query. Does EDNS work for you? EDNS problems often result to increased number of TCP queries which slows down resolution ... By the way, the resolvers are running RHEL 6.x. precise BIND version would help a bit more... seems RH6.6 contains 9.8.2 but that may be different for older RH6 versions. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. LSD will make your ECS screen display 16.7 million colors ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: file descriptor exceeds limit
On 18/06/2015 12:00, Matus UHLAR - fantomas wrote: > On 17.06.15 22:39, Shawn Zhou wrote: >> BIND on my resolvers reaches the max open file limit and I am getting >> lots >> of SERVFAILs >> http://pastebin.com/SxRsHLff > >> After I increased the max-socks (-s 8192) to 8192, I no longer saw the >> file >> limit error from the log anymore; however, I am still many SERVFAILs. > > no other errors? > >> Our resolvers were doing about 15k queries per seconds when this was >> happening and those were legit traffic. I am aware that I am setting >> recursive clients to a very high number. Those resolvers are running on >> 12-cores cpu and 24G RAM hardware. cpu utilization was at about 20% and >> plenty of RAM left. > >> I am wondering if I've reached the limit of BIND for the amount of >> recursive queries it can serve. Any other tunings I should try? > > maybe changing number of recursive-clients, max-clients-per-query. > > Does EDNS work for you? EDNS problems often result to increased number of > TCP queries which slows down resolution ... > >> By the way, the resolvers are running RHEL 6.x. > > precise BIND version would help a bit more... seems RH6.6 contains 9.8.2 > but > that may be different for older RH6 versions. > > Unless you're running a build with --with-tuning=large (for which there are a number of caveats around the capacity of the machine etc..), then you don't really want to have a backlog of recursive clients that exceeds 3000-3500. If you're getting that many in your backlog, then as already highlighted to you, there is Something Wrong going on. You're probably running into other resource limits that will be what are causing the SERVFAIL responses you're still seeing despite increasing the maximum number of sockets that named can use. I would tune down the limit to 3000 and allow named to drop the oldest outstanding client queries when new ones need to be processed. There is another logging category you can use (query-errors) that can tell you more, but it's probably not worth it in this instance. And I have another suggestion for what might be causing your backlog (apart from problems in the network path between your servers and the Internet authoritative servers), for which we have some soon-to-be-released new mitigation features (in 9.10.3): https://kb.isc.org/article/AA-01178 (this will be updated to reflect the features we will actually include in the upcoming release - but they're essentially going to be fetches-per-server and fetches-per-zone along with with improved logging/stats for both of those) There's going to be a webinar about both the problem and the mitigations on July 8th: https://www.facebook.com/events/100311766979499/ http://goo.gl/Z8idQf Hoping that this is useful? Cathy ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: file descriptor exceeds limit
Inline...responding to each of these including Kathy's soon (thanks to the community for the responses). Following with interest as we've seen this for awhile, though we are possibly a special case which I'll describe more in another response. On 6/18/15, 7:00 AM, "Matus UHLAR - fantomas" wrote: >On 17.06.15 22:39, Shawn Zhou wrote: >>BIND on my resolvers reaches the max open file limit and I am getting >>lots >> of SERVFAILs >>http://pastebin.com/SxRsHLff > >>After I increased the max-socks (-s 8192) to 8192, I no longer saw the >>file >> limit error from the log anymore; however, I am still many SERVFAILs. > >no other errors? When we've dug into it (really, the investigation is ongoing) we don't notice anything "abnormal". That means there are plenty of things being logged, but nothing you don't always see in the modern world of broken DNS servers, firewalls, network path, etc. >>Our resolvers were doing about 15k queries per seconds when this was >> happening and those were legit traffic. I am aware that I am setting >> recursive clients to a very high number. Those resolvers are running on >> 12-cores cpu and 24G RAM hardware. cpu utilization was at about 20% and >> plenty of RAM left. > >>I am wondering if I've reached the limit of BIND for the amount of >> recursive queries it can serve. Any other tunings I should try? > >maybe changing number of recursive-clients, max-clients-per-query. Have tweaked all these repeatedly, first following community best practice and then going for the sky (big iron) just to see what impact it had. None really. >Does EDNS work for you? EDNS problems often result to increased number of >TCP queries which slows down resolution ... Yeah, works fine and passes all tests (manual digs, OARC, etc). > >> By the way, the resolvers are running RHEL 6.x. > >precise BIND version would help a bit more... seems RH6.6 contains 9.8.2 >but >that may be different for older RH6 versions. We're running centos 6.x, but use the latest BIND 9.9.x releases. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
RE: file descriptor exceeds limit
Just wondering. You mention you're using RHEL6; are you also getting messages in 'dmesg' about connection tracking tables being full? You may need some 'NOTRACK' rules in your iptables. STUART BROWNE Senior Unix Administrator, Network Administrator, Database Admin P +61 9866 3710 www.bomboratech.com.au Follow us on https://twitter.com/BomboraTech The Bombora Technologies group of companies includes AusRegistry, ARI Registry Services, AusRegistry International and ZOAK Solutions. The information contained in this communication is intended for the named recipients only. It is subject to copyright and may contain legally privileged and confidential information and if you are not an intended recipient you must not use, copy, distribute or take any action in reliance on it. If you have received this communication in error, please delete all copies from your system and notify us immediately. -Original Message- From: bind-users-boun...@lists.isc.org [mailto:bind-users-boun...@lists.isc.org] On Behalf Of Mike Hoskins (michoski) Sent: Friday, 19 June 2015 2:28 AM To: Matus UHLAR - fantomas; bind-users@lists.isc.org Subject: Re: file descriptor exceeds limit Inline...responding to each of these including Kathy's soon (thanks to the community for the responses). Following with interest as we've seen this for awhile, though we are possibly a special case which I'll describe more in another response. On 6/18/15, 7:00 AM, "Matus UHLAR - fantomas" wrote: >On 17.06.15 22:39, Shawn Zhou wrote: >>BIND on my resolvers reaches the max open file limit and I am getting >>lots >> of SERVFAILs >>http://pastebin.com/SxRsHLff > >>After I increased the max-socks (-s 8192) to 8192, I no longer saw the >>file >> limit error from the log anymore; however, I am still many SERVFAILs. > >no other errors? When we've dug into it (really, the investigation is ongoing) we don't notice anything "abnormal". That means there are plenty of things being logged, but nothing you don't always see in the modern world of broken DNS servers, firewalls, network path, etc. >>Our resolvers were doing about 15k queries per seconds when this was >> happening and those were legit traffic. I am aware that I am setting >> recursive clients to a very high number. Those resolvers are running on >> 12-cores cpu and 24G RAM hardware. cpu utilization was at about 20% and >> plenty of RAM left. > >>I am wondering if I've reached the limit of BIND for the amount of >> recursive queries it can serve. Any other tunings I should try? > >maybe changing number of recursive-clients, max-clients-per-query. Have tweaked all these repeatedly, first following community best practice and then going for the sky (big iron) just to see what impact it had. None really. >Does EDNS work for you? EDNS problems often result to increased number of >TCP queries which slows down resolution ... Yeah, works fine and passes all tests (manual digs, OARC, etc). > >> By the way, the resolvers are running RHEL 6.x. > >precise BIND version would help a bit more... seems RH6.6 contains 9.8.2 >but >that may be different for older RH6 versions. We're running centos 6.x, but use the latest BIND 9.9.x releases. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: file descriptor exceeds limit
On 6/18/15, 7:09 PM, "Stuart Browne" wrote: >Just wondering. You mention you're using RHEL6; are you also getting >messages in 'dmesg' about connection tracking tables being full? You may >need some 'NOTRACK' rules in your iptables. Just following along, for the record... On our side, iptables is completely disabled. We do that sort of thing upstream on dedicated firewalls. Just now getting time to reply to Cathy...more detail on that there. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: file descriptor exceeds limit
Inline... On 6/18/15, 9:22 AM, "Cathy Almond" wrote: >On 18/06/2015 12:00, Matus UHLAR - fantomas wrote: >> On 17.06.15 22:39, Shawn Zhou wrote: >>> BIND on my resolvers reaches the max open file limit and I am getting >>> lots >>> of SERVFAILs >>> http://pastebin.com/SxRsHLff >> >>> After I increased the max-socks (-s 8192) to 8192, I no longer saw the >>> file >>> limit error from the log anymore; however, I am still many SERVFAILs. >> >> no other errors? >> >>> Our resolvers were doing about 15k queries per seconds when this was >>> happening and those were legit traffic. I am aware that I am setting >>> recursive clients to a very high number. Those resolvers are running >>>on >>> 12-cores cpu and 24G RAM hardware. cpu utilization was at about 20% >>>and >>> plenty of RAM left. >> >>> I am wondering if I've reached the limit of BIND for the amount of >>> recursive queries it can serve. Any other tunings I should try? >> >> maybe changing number of recursive-clients, max-clients-per-query. >> >> Does EDNS work for you? EDNS problems often result to increased number >>of >> TCP queries which slows down resolution ... >> >>> By the way, the resolvers are running RHEL 6.x. >> >> precise BIND version would help a bit more... seems RH6.6 contains 9.8.2 >> but >> that may be different for older RH6 versions. >> >> > >Unless you're running a build with --with-tuning=large (for which there >are a number of caveats around the capacity of the machine etc..), then >you don't really want to have a backlog of recursive clients that >exceeds 3000-3500. If you're getting that many in your backlog, then as >already highlighted to you, there is Something Wrong going on. We're running --with-tuning=large, but I think we are OK (128GB RAM, 32 cores). If there are other caveats to be aware of, please share. For years I kept recursive clients conservatively set (based on some of your docs, and community comments). I finally raised it much higher just to see what would happen (after having to repeatedly explain why blindly increasing that number wasn't a good thing), and it had no effect one way or another. Still got the servfails. We are in a somewhat unique situation, because we have batch type jobs generating rules/etc which often purposefully crawl the "bad" parts of the 'Net and in turn generate DNS requests for things which legitimately return servfail. However, we were getting increasingly consistent complaints from users about seeing servfails where they weren't expected. The biggest thing which helped for us was increasing DISC_SOCKET_MAXEVENTS. We're still digging to see if the remaining servfail reports are genuinely something we can tune around, or a symptom of the use case. >You're probably running into other resource limits that will be what are >causing the SERVFAIL responses you're still seeing despite increasing >the maximum number of sockets that named can use. I would tune down the >limit to 3000 and allow named to drop the oldest outstanding client >queries when new ones need to be processed. I'm going to crank this back down in our environments. >There is another logging category you can use (query-errors) that can >tell you more, but it's probably not worth it in this instance. > >And I have another suggestion for what might be causing your backlog >(apart from problems in the network path between your servers and the >Internet authoritative servers), for which we have some >soon-to-be-released new mitigation features (in 9.10.3): > >https://kb.isc.org/article/AA-01178 > >(this will be updated to reflect the features we will actually include >in the upcoming release - but they're essentially going to be >fetches-per-server and fetches-per-zone along with with improved >logging/stats for both of those) > >There's going to be a webinar about both the problem and the mitigations >on July 8th: > >https://www.facebook.com/events/100311766979499/ > >http://goo.gl/Z8idQf Looking forward to this. We've been sticking to 9.9.x (currently running 9.9.7) as an ESV release, but maybe 9.10 makes sense. Not sure how the community feels about that? For the record I've spent a lot of time with our network team looking at firewall logs, getting packet traces, etc and not found any smoking guns. We have a perhaps not so unique setup where the caches are in a DMZ, so clients talk through a firewall, and the DNS servers talk through a firewall. I've identified and fixed a number of issues along the way...enumerating here in case it helps anyone else. The internal firewall was oversubscribed, and at peak times would reset connections causing clients to retry which quickly wound up recursive clients. Replaced those firewalls, and that specific behavior got a lot better. The external firewall was sharing a PAT for all caches, which eventually exhausted 65k ports. Can't drop these direct on the 'Net for security reasons, but now have 1-to-1 NAT per cache and haven't seen this exact b
dnssec validation issue
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I have multiple centos6 boxes running 9.10.2-P1, and almost everything looks good. However, one box seems to not be doing dnssec validation. It is possible that this behavior predates the latest updates and I just never noticed it. A and B have essentially identical configuration, except that A is the master for some zones, and B is the slave pulling from A. Other than that, the /etc/named.conf is identical. A also has ipv6 connectivity, and B does not. The authoritative side works nicely on both. The recursive resolver is where the difference shows up. On A: dig www.dnssec-failed.org @localhost ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19813 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 5, ADDITIONAL: 11 ;; ANSWER SECTION: www.dnssec-failed.org. 7178IN A 68.87.109.242 www.dnssec-failed.org. 7178IN A 69.252.193.191 On B: dig www.dnssec-failed.org @localhost ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 4969 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 /etc/named.conf: options { directory "/var/named"; allow-recursion { "friends"; }; dnssec-enable yes; dnssec-validation yes; bindkeys-file "/etc/named.iscdlv.key"; managed-keys-directory "/var/named/dynamic"; listen-on-v6 {any;}; ixfr-from-differences yes; max-journal-size 2m; notify yes; response-policy { zone "rpz.five-ten-sg.com";} qname-wait-recurse no; filter--on-v4 yes; filter- { "brokenv6"; }; rate-limit { responses-per-second 5; errors-per-second5; nxdomains-per-second 40; qps-scale300; exempt-clients { "friends"; }; }; }; A is neither master nor slave for dnssec-failed.org, and that domain is not mentioned in the rpz zone. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.14 (GNU/Linux) iEYEARECAAYFAlWDYtAACgkQL6j7milTFsHClQCeLKkTuQYlM4liB0UECG5Z4pui ujMAnj4wnUWqJj258pIlUFo0IONtkkEP =/QDW -END PGP SIGNATURE- ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: dnssec validation issue
In message <1434674101.18744.119.ca...@ns.five-ten-sg.com>, Carl Byington write s: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > I have multiple centos6 boxes running 9.10.2-P1, and almost everything > looks good. However, one box seems to not be doing dnssec validation. It > is possible that this behavior predates the latest updates and I just > never noticed it. > > A and B have essentially identical configuration, except that A is the > master for some zones, and B is the slave pulling from A. Other than > that, the /etc/named.conf is identical. A also has ipv6 connectivity, > and B does not. The authoritative side works nicely on both. The > recursive resolver is where the difference shows up. > > On A: > > dig www.dnssec-failed.org @localhost > ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19813 > ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 5, ADDITIONAL: 11 > ;; ANSWER SECTION: > www.dnssec-failed.org. 7178IN A 68.87.109.242 > www.dnssec-failed.org. 7178IN A 69.252.193.191 > > > > On B: > dig www.dnssec-failed.org @localhost > ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 4969 > ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 > You don't have any trust anchors active. To use the keys in "/etc/named.iscdlv.key" set "dnssec-validation auto;" > /etc/named.conf: > > options { > directory "/var/named"; > allow-recursion { "friends"; }; > dnssec-enable yes; > dnssec-validation yes; > bindkeys-file "/etc/named.iscdlv.key"; > managed-keys-directory "/var/named/dynamic"; > listen-on-v6 {any;}; > ixfr-from-differences yes; > max-journal-size 2m; > notify yes; > response-policy { zone "rpz.five-ten-sg.com";} > qname-wait-recurse no; > filter--on-v4 yes; > filter- { "brokenv6"; }; > rate-limit { > responses-per-second 5; > errors-per-second5; > nxdomains-per-second 40; > qps-scale300; > exempt-clients { "friends"; }; > }; > }; > > > A is neither master nor slave for dnssec-failed.org, and that domain is > not mentioned in the rpz zone. > > > > > -BEGIN PGP SIGNATURE- > Version: GnuPG v2.0.14 (GNU/Linux) > > iEYEARECAAYFAlWDYtAACgkQL6j7milTFsHClQCeLKkTuQYlM4liB0UECG5Z4pui > ujMAnj4wnUWqJj258pIlUFo0IONtkkEP > =/QDW > -END PGP SIGNATURE- > > > ___ > Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe > from this list > > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: dnssec validation issue
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Fri, 2015-06-19 at 11:10 +1000, Mark Andrews wrote: > You don't have any trust anchors active. > To use the keys in "/etc/named.iscdlv.key" set "dnssec-validation > auto;" Thanks!! New centos rpms at http://www.five-ten-sg.com/mapper/bind with a default named.conf that should actually work. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.14 (GNU/Linux) iEYEARECAAYFAlWDfboACgkQL6j7milTFsEsYgCcDCJgzbdD4quzkp8tI+hFIsfq oQAAnRTCvYt4K9t98AjGnruiJqTxAj5y =DOlX -END PGP SIGNATURE- ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: dnssec validation issue
On Thu, Jun 18, 2015 at 07:26:28PM -0700, Carl Byington wrote: > On Fri, 2015-06-19 at 11:10 +1000, Mark Andrews wrote: > > To use the keys in "/etc/named.iscdlv.key" set "dnssec-validation > > auto;" > New centos rpms at http://www.five-ten-sg.com/mapper/bind with a default > named.conf that should actually work. With the root zone and most TLDs signed, I do not think it makes sense to use DLV anymore. While a typical DNSSEC resolver configuration has DLV enabled, I personally make the effort to disable it. -- Eray ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users