Hi, We are experiencing strange intermittent issues when resolving outlook.office365.com, but also with other domains like e.g. amazonaws.com or snort.org. But let’s choose office365.com as example for now. outlook.office365.com is a CNAME to lb.geo.office365.com, and office365.com delegates the geo subdomain to different nameservers; 2 of them are showing some issues on intodns.com [1] (which may or may not be related to this problem).
When querying one of the office365.com nameservers, it correctly delegates, as far as I understand: # dig a lb.geo.office365.com @ns1.msft.net +noadditional +nostats ; <<>> DiG 9.10.4 <<>> a lb.geo.office365.com @ns1.msft.net +noadditional +nostats ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37098 ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 6, ADDITIONAL: 5 ;; WARNING: recursion requested but not available ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4000 ;; QUESTION SECTION: ;lb.geo.office365.com. IN A ;; AUTHORITY SECTION: geo.office365.com. 300 IN NS glb1.glbdns2.microsoft.com. geo.office365.com. 300 IN NS ns1.p21.dynect.net. geo.office365.com. 300 IN NS ns3.p21.dynect.net. geo.office365.com. 300 IN NS ns4.p21.dynect.net. geo.office365.com. 300 IN NS ns2.p21.dynect.net. geo.office365.com. 300 IN NS glb2.glbdns2.microsoft.com. Still, BIND (sometimes) decides to return SERVFAIL to the client immediately after receiving this response. Some interesting debug log lines: resolver: debug 3: resquery 0x7f26fecc8010 (fctx 0x7f26fecb4458(lb.geo.office365.com/A)): sent resolver: debug 3: resquery 0x7f26fecc8010 (fctx 0x7f26fecb4458(lb.geo.office365.com/A)): response resolver: debug 10: received packet: resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): noanswer_response resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: noanswer_response: lb.geo.office365.com (in 'office365.com'?): 1 172499 resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: DELEGATION: lb.geo.office365.com (in 'geo.office365.com'?): 0 172499 resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): cache_message resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): [result: success] query canceled in response(); responding resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): cancelquery resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): nameservers now above QDOMAIN resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): done resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): stopeverything resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): cancelqueries resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): sendevents client: error: query client=0x7f2700055ca0 thread=0x7f2709813700 (lb.geo.office365.com/A): query_find: unexpected error after resuming: SERVFAIL query-errors: debug 1: client 127.0.0.1#35062 (outlook.office365.com): query failed (SERVFAIL) for outlook.office365.com/IN/A at query.c:7837 “nameservers now above QDOMAIN” sounds like a geo.office365.com nameserver refers back to an office365.com nameserver? The thing is though, I cannot see any such response packet in tcpdump. Is this information taken (wrongly) from cache then? The same log message appears at all times for any of the failing domains we’ve seen so far. Note that this doesn’t seem to happen with an empty cache and we are also not able to trigger it on a test machine. It only happens on loaded machines once the cache TTL of the queried record expires. We can reproduce it with the latest patch levels of both 9.10 and 9.9. Regards, Thomas [1] http://intodns.com/geo.office365.com
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users