Dear lists [apologies if you receive two copies of this message], I am in the process of implementing anycast recursive DNS service for our campus using a combination of servers running Bind 9.8.0 and Cisco's IP SLA feature. There are three identical Redhat servers connected to three different routers with point-to-point /30 links. The servers are configured with an anycast address attached to an alias of the loopback interface:
[note: these are not the actual IP addresses] lo:1 Link encap:Local Loopback inet addr:192.168.32.32 Mask:255.255.255.255 UP LOOPBACK RUNNING MTU:16436 Metric:1 These caching servers are also configured as stealth slaves for our zones (using Bind's 'also-notify' option in our master). This allows us to serve the latest contents of our zones without having to wait for TTLs to expire. In our tests, we've come across a very interesting but annoying problem. After several hours of operation, the servers start to respond to CNAME queries in an inconsistent manner. For example: # dig @192.168.32.32 www.uoregon.edu ; <<>> DiG 9.8.0-RedHat-9.8.0-4.uopel5 <<>> @192.168.32.32 www.uoregon.edu ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14280 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 6, ADDITIONAL: 4 ;; QUESTION SECTION: ;www.uoregon.edu. IN A ;; ANSWER SECTION: www.uoregon.edu. 600 IN CNAME uowc-www.uoregon.edu. uowc-www.uoregon.edu. 86400 IN A 192.168.142.125 ;; AUTHORITY SECTION: uoregon.edu. 86400 IN NS phloem.uoregon.edu. uoregon.edu. 86400 IN NS bigdog.lsu.edu. uoregon.edu. 86400 IN NS sns-pb.isc.org. uoregon.edu. 86400 IN NS arizona.edu. uoregon.edu. 86400 IN NS ruminant.uoregon.edu. uoregon.edu. 86400 IN NS dns.cs.uoregon.edu. ;; ADDITIONAL SECTION: phloem.uoregon.edu. 86400 IN A 192.168.32.35 phloem.uoregon.edu. 86400 IN AAAA 2001:468:d01:20::80df:2023 ruminant.uoregon.edu. 86400 IN A 192.168.60.22 ruminant.uoregon.edu. 86400 IN AAAA 2001:468:d01:3c::80df:3c16 ;; Query time: 0 msec ;; SERVER: 192.168.32.32#53(192.168.32.32) ;; WHEN: Wed May 18 12:51:06 2011 ;; MSG SIZE rcvd: 300 # dig @192.168.32.32 www.uoregon.edu ; <<>> DiG 9.8.0-RedHat-9.8.0-4.uopel5 <<>> @192.168.32.32 www.uoregon.edu ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34776 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;www.uoregon.edu. IN A ;; ANSWER SECTION: www.uoregon.edu. 600 IN CNAME uowc-www.uoregon.edu. As you can see, the second response does not include the AUTHORITY or the ADDITIONAL sections. This causes our users' machines to fail to resolve the A records because the resolver library does not query a second time. This second type of response appears to be the server acting as an authoritative-only server, not as a caching recursive server. Here are the most interesting details: - We have only observed this happening when querying the anycast address, not the address associated with the ethernet interface. - The behavior is independent of the network. We can replicate it by querying the anycast address from the server itself. - Our production (non-anycast) servers run the exact same version of Bind with the exact same configuration, and we have never observed this problem. - Bind's debugging output is exactly the same in both cases, so it offers no clues about the difference in responses. - Restarting Bind, the problem goes away for several hours. It requires the server to receive query traffic during those hours, otherwise the problem does not happen. Here's the options section of the config: options { version "9999.9.9"; recursive-clients 5000; directory "/etc/named"; allow-transfer { none; }; blackhole { attackers; }; listen-on-v6 { any; }; allow-recursion { customers; }; allow-query { any; }; dnssec-enable yes; dnssec-validation yes; }; Bind is listening on the anycast address (in addition to its NIC IP address): # netstat -lnp |grep 192.168.32.32 tcp 0 0 192.168.32.32:53 0.0.0.0:* LISTEN 30771/named udp 0 0 192.168.32.32:53 0.0.0.0:* 30771/named These are the details of our Bind daemon (custom-built RPM, based on Fedora's source RPM): # named -V BIND 9.8.0-RedHat-9.8.0-4.uopel5 built with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--target=x86_64-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/usr/com' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-libtool' '--localstatedir=/var' '--enable-threads' '--enable-ipv6' '--with-pic' '--disable-static' '--disable-openssl-version-check' '--enable-exportlib' '--with-export-libdir=/usr/lib64' '--with-export-includedir=/usr/include' '--includedir=/usr/include/bind9' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'target_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic' 'CPPFLAGS= -DDIG_SIGCHASE' 'CXXFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic' 'FFLAGS=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic' using OpenSSL version: OpenSSL 0.9.8e-rhel5 01 Jul 2008 using libxml2 version: 2.6.26 # uname -a Linux adns1 2.6.18-238.9.1.el5 #1 SMP Fri Mar 18 12:42:39 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux # cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.6 (Tikanga) I would really appreciate any help with this. Thanks in advance, _______________________________________________ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users