A fix for this is in review and should be in the next maintainance release.
Mark In message <16a2cdfd-694d-444a-a760-17c9d7517...@open.ch>, Thomas Sturm writes: > > I am now able to reliably reproduce the behaviour with dig querying BIND > 9.10.4-P1 (not 9.9, apparently) with "prefetch 0â: > > $ while true; do dig outlook.office365.com +noauthority +noadditional > +tries=1 +retry=0; sleep 0.1; done > > Wait for 5 minutes, once the TTL expires, this should show about 5-7 > SERVFAIL responses. > > prefetch 1 or 2 makes it harder to reproduce and it only happens > (sometimes) on loaded systems. prefetch 10 makes it go away. > > It never happens after restarting or flushing the cache. And it never > happens when querying x seconds _after_ the TTL expired. Could there be > an issue processing cached client requests during cache expiry, and since > it only happens on 9.10, potentially related to prefetching? > > > > > On 16.06.2016, at 10:00, Thomas Sturm <t...@open.ch> wrote: > > > > Hi, > > > > We are experiencing strange intermittent issues when resolving > outlook.office365.com, but also with other domains like e.g. > amazonaws.com or snort.org. But letâs choose office365.com as example for > now. outlook.office365.com is a CNAME to lb.geo.office365.com, and > office365.com delegates the geo subdomain to different nameservers; 2 of > them are showing some issues on intodns.com [1] (which may or may not be > related to this problem). > > > > When querying one of the office365.com nameservers, it correctly > delegates, as far as I understand: > > > > # dig a lb.geo.office365.com @ns1.msft.net +noadditional +nostats > > > > ; <<>> DiG 9.10.4 <<>> a lb.geo.office365.com @ns1.msft.net > +noadditional +nostats > > ;; global options: +cmd > > ;; Got answer: > > ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37098 > > ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 6, ADDITIONAL: 5 > > ;; WARNING: recursion requested but not available > > > > ;; OPT PSEUDOSECTION: > > ; EDNS: version: 0, flags:; udp: 4000 > > ;; QUESTION SECTION: > > ;lb.geo.office365.com. IN A > > > > ;; AUTHORITY SECTION: > > geo.office365.com. 300 IN NS > glb1.glbdns2.microsoft.com. > > geo.office365.com. 300 IN NS ns1.p21.dynect.net. > > geo.office365.com. 300 IN NS ns3.p21.dynect.net. > > geo.office365.com. 300 IN NS ns4.p21.dynect.net. > > geo.office365.com. 300 IN NS ns2.p21.dynect.net. > > geo.office365.com. 300 IN NS > glb2.glbdns2.microsoft.com. > > > > Still, BIND (sometimes) decides to return SERVFAIL to the client > immediately after receiving this response. Some interesting debug log > lines: > > > > resolver: debug 3: resquery 0x7f26fecc8010 (fctx > 0x7f26fecb4458(lb.geo.office365.com/A)): sent > > resolver: debug 3: resquery 0x7f26fecc8010 (fctx > 0x7f26fecb4458(lb.geo.office365.com/A)): response > > resolver: debug 10: received packet: > > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): > noanswer_response > > resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: noanswer_response: > lb.geo.office365.com (in 'office365.com'?): 1 172499 > > resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: DELEGATION: > lb.geo.office365.com (in 'geo.office365.com'?): 0 172499 > > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): > cache_message > > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): > [result: success] query canceled in response(); responding > > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): > cancelquery > > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): > nameservers now above QDOMAIN > > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): done > > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): > stopeverything > > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): > cancelqueries > > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): > sendevents > > client: error: query client=0x7f2700055ca0 thread=0x7f2709813700 > (lb.geo.office365.com/A): query_find: unexpected error after resuming: > SERVFAIL > > query-errors: debug 1: client 127.0.0.1#35062 (outlook.office365.com): > query failed (SERVFAIL) for outlook.office365.com/IN/A at query.c:7837 > > > > ânameservers now above QDOMAINâ sounds like a geo.office365.com > nameserver refers back to an office365.com nameserver? The thing is > though, I cannot see any such response packet in tcpdump. Is this > information taken (wrongly) from cache then? The same log message appears > at all times for any of the failing domains weâve seen so far. > > > > Note that this doesnât seem to happen with an empty cache and we are > also not able to trigger it on a test machine. It only happens on loaded > machines once the cache TTL of the queried record expires. We can > reproduce it with the latest patch levels of both 9.10 and 9.9. > > > > Regards, > > Thomas > > > > > > [1] > http://intodns.com/geo.office365.com______________________________________ > _________ > > Please visit https://lists.isc.org/mailman/listinfo/bind-users to > unsubscribe from this list > > > > bind-users mailing list > > bind-users@lists.isc.org > > https://lists.isc.org/mailman/listinfo/bind-users > > > -- > thomas sturm > principal engineer > > open systems ag > raeffelstrasse 29 > ch-8045 zurich > t: +41 58 100 10 10 > f: +41 58 100 10 11 > > t...@open.ch > > http://www.open.ch > > -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org
_______________________________________________ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users