I am now able to reliably reproduce the behaviour with dig querying BIND 9.10.4-P1 (not 9.9, apparently) with "prefetch 0”:
$ while true; do dig outlook.office365.com +noauthority +noadditional +tries=1 +retry=0; sleep 0.1; done Wait for 5 minutes, once the TTL expires, this should show about 5-7 SERVFAIL responses. prefetch 1 or 2 makes it harder to reproduce and it only happens (sometimes) on loaded systems. prefetch 10 makes it go away. It never happens after restarting or flushing the cache. And it never happens when querying x seconds _after_ the TTL expired. Could there be an issue processing cached client requests during cache expiry, and since it only happens on 9.10, potentially related to prefetching? > On 16.06.2016, at 10:00, Thomas Sturm <t...@open.ch> wrote: > > Hi, > > We are experiencing strange intermittent issues when resolving > outlook.office365.com, but also with other domains like e.g. amazonaws.com or > snort.org. But let’s choose office365.com as example for now. > outlook.office365.com is a CNAME to lb.geo.office365.com, and office365.com > delegates the geo subdomain to different nameservers; 2 of them are showing > some issues on intodns.com [1] (which may or may not be related to this > problem). > > When querying one of the office365.com nameservers, it correctly delegates, > as far as I understand: > > # dig a lb.geo.office365.com @ns1.msft.net +noadditional +nostats > > ; <<>> DiG 9.10.4 <<>> a lb.geo.office365.com @ns1.msft.net +noadditional > +nostats > ;; global options: +cmd > ;; Got answer: > ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37098 > ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 6, ADDITIONAL: 5 > ;; WARNING: recursion requested but not available > > ;; OPT PSEUDOSECTION: > ; EDNS: version: 0, flags:; udp: 4000 > ;; QUESTION SECTION: > ;lb.geo.office365.com. IN A > > ;; AUTHORITY SECTION: > geo.office365.com. 300 IN NS glb1.glbdns2.microsoft.com. > geo.office365.com. 300 IN NS ns1.p21.dynect.net. > geo.office365.com. 300 IN NS ns3.p21.dynect.net. > geo.office365.com. 300 IN NS ns4.p21.dynect.net. > geo.office365.com. 300 IN NS ns2.p21.dynect.net. > geo.office365.com. 300 IN NS glb2.glbdns2.microsoft.com. > > Still, BIND (sometimes) decides to return SERVFAIL to the client immediately > after receiving this response. Some interesting debug log lines: > > resolver: debug 3: resquery 0x7f26fecc8010 (fctx > 0x7f26fecb4458(lb.geo.office365.com/A)): sent > resolver: debug 3: resquery 0x7f26fecc8010 (fctx > 0x7f26fecb4458(lb.geo.office365.com/A)): response > resolver: debug 10: received packet: > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): > noanswer_response > resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: noanswer_response: > lb.geo.office365.com (in 'office365.com'?): 1 172499 > resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: DELEGATION: > lb.geo.office365.com (in 'geo.office365.com'?): 0 172499 > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): cache_message > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): [result: > success] query canceled in response(); responding > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): cancelquery > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): nameservers > now above QDOMAIN > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): done > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): stopeverything > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): cancelqueries > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): sendevents > client: error: query client=0x7f2700055ca0 thread=0x7f2709813700 > (lb.geo.office365.com/A): query_find: unexpected error after resuming: > SERVFAIL > query-errors: debug 1: client 127.0.0.1#35062 (outlook.office365.com): query > failed (SERVFAIL) for outlook.office365.com/IN/A at query.c:7837 > > “nameservers now above QDOMAIN” sounds like a geo.office365.com nameserver > refers back to an office365.com nameserver? The thing is though, I cannot see > any such response packet in tcpdump. Is this information taken (wrongly) from > cache then? The same log message appears at all times for any of the failing > domains we’ve seen so far. > > Note that this doesn’t seem to happen with an empty cache and we are also not > able to trigger it on a test machine. It only happens on loaded machines once > the cache TTL of the queried record expires. We can reproduce it with the > latest patch levels of both 9.10 and 9.9. > > Regards, > Thomas > > > [1] > http://intodns.com/geo.office365.com_______________________________________________ > Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe > from this list > > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users -- thomas sturm principal engineer open systems ag raeffelstrasse 29 ch-8045 zurich t: +41 58 100 10 10 f: +41 58 100 10 11 t...@open.ch http://www.open.ch
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users