I am now able to reliably reproduce the behaviour with dig querying BIND 
9.10.4-P1 (not 9.9, apparently) with "prefetch 0”:

$ while true; do dig outlook.office365.com +noauthority +noadditional +tries=1 
+retry=0; sleep 0.1; done

Wait for 5 minutes, once the TTL expires, this should show about 5-7 SERVFAIL 
responses. 

prefetch 1 or 2 makes it harder to reproduce and it only happens (sometimes) on 
loaded systems. prefetch 10 makes it go away. 

It never happens after restarting or flushing the cache. And it never happens 
when querying x seconds _after_ the TTL expired. Could there be an issue 
processing cached client requests during cache expiry, and since it only 
happens on 9.10, potentially related to prefetching?



> On 16.06.2016, at 10:00, Thomas Sturm <t...@open.ch> wrote:
> 
> Hi,
> 
> We are experiencing strange intermittent issues when resolving 
> outlook.office365.com, but also with other domains like e.g. amazonaws.com or 
> snort.org. But let’s choose office365.com as example for now. 
> outlook.office365.com is a CNAME to lb.geo.office365.com, and office365.com 
> delegates the geo subdomain to different nameservers; 2 of them are showing 
> some issues on intodns.com [1] (which may or may not be related to this 
> problem). 
> 
> When querying one of the office365.com nameservers, it correctly delegates, 
> as far as I understand:
> 
> # dig a lb.geo.office365.com @ns1.msft.net +noadditional +nostats
> 
> ; <<>> DiG 9.10.4 <<>> a lb.geo.office365.com @ns1.msft.net +noadditional 
> +nostats
> ;; global options: +cmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37098
> ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 6, ADDITIONAL: 5
> ;; WARNING: recursion requested but not available
> 
> ;; OPT PSEUDOSECTION:
> ; EDNS: version: 0, flags:; udp: 4000
> ;; QUESTION SECTION:
> ;lb.geo.office365.com.                IN      A
> 
> ;; AUTHORITY SECTION:
> geo.office365.com.    300     IN      NS      glb1.glbdns2.microsoft.com.
> geo.office365.com.    300     IN      NS      ns1.p21.dynect.net.
> geo.office365.com.    300     IN      NS      ns3.p21.dynect.net.
> geo.office365.com.    300     IN      NS      ns4.p21.dynect.net.
> geo.office365.com.    300     IN      NS      ns2.p21.dynect.net.
> geo.office365.com.    300     IN      NS      glb2.glbdns2.microsoft.com.
> 
> Still, BIND (sometimes) decides to return SERVFAIL to the client immediately 
> after receiving this response. Some interesting debug log lines:
> 
> resolver: debug 3: resquery 0x7f26fecc8010 (fctx 
> 0x7f26fecb4458(lb.geo.office365.com/A)): sent
> resolver: debug 3: resquery 0x7f26fecc8010 (fctx 
> 0x7f26fecb4458(lb.geo.office365.com/A)): response
> resolver: debug 10: received packet:
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): 
> noanswer_response
> resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: noanswer_response: 
> lb.geo.office365.com (in 'office365.com'?): 1 172499
> resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: DELEGATION: 
> lb.geo.office365.com (in 'geo.office365.com'?): 0 172499
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): cache_message
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): [result: 
> success] query canceled in response();  responding
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): cancelquery
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): nameservers 
> now above QDOMAIN
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): done
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): stopeverything
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): cancelqueries
> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): sendevents
> client: error: query client=0x7f2700055ca0 thread=0x7f2709813700 
> (lb.geo.office365.com/A): query_find: unexpected error after resuming: 
> SERVFAIL
> query-errors: debug 1: client 127.0.0.1#35062 (outlook.office365.com): query 
> failed (SERVFAIL) for outlook.office365.com/IN/A at query.c:7837
> 
> “nameservers now above QDOMAIN” sounds like a geo.office365.com nameserver 
> refers back to an office365.com nameserver? The thing is though, I cannot see 
> any such response packet in tcpdump. Is this information taken (wrongly) from 
> cache then? The same log message appears at all times for any of the failing 
> domains we’ve seen so far. 
> 
> Note that this doesn’t seem to happen with an empty cache and we are also not 
> able to trigger it on a test machine. It only happens on loaded machines once 
> the cache TTL of the queried record expires. We can reproduce it with the 
> latest patch levels of both 9.10 and 9.9. 
> 
> Regards,
> Thomas
> 
> 
> [1] 
> http://intodns.com/geo.office365.com_______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
> from this list
> 
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users


-- 
thomas sturm
principal engineer

open systems ag
raeffelstrasse 29
ch-8045 zurich
t: +41 58 100 10 10
f: +41 58 100 10 11

t...@open.ch

http://www.open.ch

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Reply via email to