How does the max-recursion-queries counter interact with DNSSEC validation and RPZ validation? Are the queries for these checks included in the max-recursion-queries count or are they in a separate queue?
Why I am asking: I've been running through my test of the new code and getting a few hits on domains that I think resolve without these limits. I've have more work to do on validating if these domains didn't resolve because of some momentary network or external DNS resolution issue or something related to these new thresholds. (or in the case of 4842b.y.dotnxdomain.net, Cricket out registering new domains just to mess with us.) My test is ~5 million unique A record lookups I'm pushing to a test server at ~300 q/s. It has a lengthy RPZ enabled and DNSSec validation on. After reading this thread, I'm flushing the cache every 30 mins. I'm getting a handful of these messages, some are just broken domains but a handful of them seem to resolve on DNS servers on older bind code. They do not seem to be timed with the cache clearing. Dec 9 13:59:11 198.206.x.x named[13525]: exceeded max queries resolving 'growthcentre.org/A' Dec 9 14:15:05 198.206.x.x named[13525]: exceeded max queries resolving 'megadeth.rockmetal.art.pl/A' Dec 9 14:22:33 198.206.x.x named[13525]: exceeded max queries resolving 'ns3.iplay.net/A' Dec 10 03:18:54 198.206.x.x named[13525]: exceeded max queries resolving '4842b.y.dotnxdomain.net/DNSKEY' Dec 10 03:59:02 198.206.x.x named[13525]: exceeded max queries resolving 'dsl-188-34-202-200.asretelecom.net/A' Dec 10 03:59:03 198.206.x.x named[13525]: exceeded max queries resolving 'ns1.asretelecom.com/A' Dec 10 08:19:15 198.206.x.x named[13525]: exceeded max queries resolving 'knurow.eu.org/A' Dec 10 08:27:36 198.206.x.x named[13525]: exceeded max queries resolving 'lb.z.optimix.asia/NS' Dec 10 08:31:04 198.206.x.x named[13525]: exceeded max queries resolving 'NS4-AUTH.ALLTEL.NET/A' David A. Evans Enterprise IP/DNS Management Network Infrastructure Tools and Services From: Evan Hunt <e...@isc.org> To: Stuart Henderson <s...@spacehopper.org> Cc: Tony Finch <d...@dotat.at>, bind-users@lists.isc.org Date: 12/09/2014 01:41 PM Subject: Re: Problem with BIND 9.10.1-P1 recursion limits Sent by: bind-users-boun...@lists.isc.org On Tue, Dec 09, 2014 at 05:51:58PM +0000, Evan Hunt wrote: > That's unexpected. I'll see if I can reproduce it. Okay, I can. Part of the problem is the somewhat crazypants DNS configuration of www.ibm.com: $ dig +noall +answer www.ibm.com www.ibm.com. 3600 IN CNAME www.ibm.com.cs186.net. www.ibm.com.cs186.net. 60 IN CNAME china-cdn.san.ibm.com.edgekey.net. china-cdn.san.ibm.com.edgekey.net. 21600 IN CNAME china-cdn.san.ibm.com.edgekey.net.globalredir.akadns.net. china-cdn.san.ibm.com.edgekey.net.globalredir.akadns.net. 900 IN CNAME e7826.x.akamaiedge.net. e7826.x.akamaiedge.net. 20 IN A 23.59.201.136 ... like, *wow*. A chain of five aliases with TTLs ranging from 20 seconds to 6 hours, passing through five different zones (ibm.com, cs186.net, edgekey.net, akadns.net, akamaiedge.net), hosted by servers in three *more* zones (ihost.com, akam.net, and akadns.org, in addition to akadns.net and akamaiedge.net). I had to almost double the maximum recursion queries to 99 to get this to work on an empty cache. Yikes. Almost any non-empty cache will dodge the bullet. Preceeding the lookup of www.ibm.com with "dig @::1 ns com" causes the query to succeed. Also, as previously noted, on 9.9 it will succeed without a five-minute delay if you just issue the query a second time. So, possible workarounds if this issue is causing problems for you: - Ensure that the first query sent to a newly-primed recursive resolver isn't quite as spectacular as this one; - Add "max-recursion-queries 100;" to your options statement; - Run 9.9.6-P1 instead of 9.10.1-P1 The five-minute delay is still a bit of a puzzle. It happens because of this code in adb.c: /* XXXMLG Don't pound on bad servers. */ if (address_type == DNS_ADBFIND_INET) { name->expire_v4 = ISC_MIN(name->expire_v4, now + 300); name->fetch_err = FIND_ERR_FAILURE; inc_stats(adb, dns_resstatscounter_gluefetchv4fail); } else { name->expire_v6 = ISC_MIN(name->expire_v6, now + 300); name->fetch6_err = FIND_ERR_FAILURE; inc_stats(adb, dns_resstatscounter_gluefetchv6fail); } The "now + 300" bit is where the five minutes comes from. That's code that's been around for years, and it is in 9.9, but apparently it's reached more easily in 9.10. I'm looking into the reasons for this. The problem should be addressed in 9.10.2, which is likely to be released next month. -- Evan Hunt -- e...@isc.org Internet Systems Consortium, Inc. _______________________________________________ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
_______________________________________________ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users