We have a mystery. We're running a recursive resolver on RHEL6, using the latest RHEL-provided BIND package, bind-9.8.2-0.37.rc1.el6_7.6. The recursive resolver only has an IPv4 interface; it does not have an IPv6 interface. DNSSEC is enabled (by default).
Our recursive resolver periodically returns SERVFAIL for lookups for hhs.gov records, which are served by these nameservers: rh202ns1.355.dhhs.gov. 168 IN A 158.74.30.98 rh202ns1.355.dhhs.gov. 14260 IN AAAA 2607:f220:0:1::2a rh202ns2.355.dhhs.gov. 168 IN A 158.74.30.99 rh202ns2.355.dhhs.gov. 14260 IN AAAA 2607:f220:0:1::2b rh120ns2.368.dhhs.gov. 81 IN A 158.74.30.103 rh120ns2.368.dhhs.gov. 81 IN AAAA 2607:f220:0:1::2d rh120ns1.368.dhhs.gov. 168 IN A 158.74.30.102 rh120ns1.368.dhhs.gov. 14260 IN AAAA 2607:f220:0:1::2c When this happens, BIND logs the following: 01-Mar-2016 09:10:02.064 lame-servers: info: error (network unreachable) resolving 'hhs.gov/MX/IN': 2607:f220:0:1::2c#53 01-Mar-2016 09:10:02.064 lame-servers: info: error (network unreachable) resolving 'hhs.gov/MX/IN': 2607:f220:0:1::2a#53 01-Mar-2016 09:10:02.064 lame-servers: info: error (network unreachable) resolving 'hhs.gov/MX/IN': 2607:f220:0:1::2d#53 01-Mar-2016 09:10:02.065 lame-servers: info: error (network unreachable) resolving 'hhs.gov/MX/IN': 2607:f220:0:1::2b#53 01-Mar-2016 09:10:02.065 lame-servers: info: error (network unreachable) resolving 'rh120ns2.368.dhhs.gov/A/IN': 2607:f220:0:1::2c#53 01-Mar-2016 09:10:02.065 lame-servers: info: error (network unreachable) resolving 'rh120ns1.368.dhhs.gov/A/IN': 2607:f220:0:1::2c#53 01-Mar-2016 09:10:02.066 lame-servers: info: error (network unreachable) resolving 'rh202ns1.355.dhhs.gov/A/IN': 2607:f220:0:1::2c#53 01-Mar-2016 09:10:02.066 lame-servers: info: error (network unreachable) resolving 'rh120ns2.368.dhhs.gov/A/IN': 2607:f220:0:1::2a#53 01-Mar-2016 09:10:02.066 lame-servers: info: error (network unreachable) resolving 'rh202ns2.355.dhhs.gov/A/IN': 2607:f220:0:1::2c#53 01-Mar-2016 09:10:02.066 lame-servers: info: error (network unreachable) resolving 'rh202ns1.355.dhhs.gov/A/IN': 2607:f220:0:1::2a#53 01-Mar-2016 09:10:02.066 lame-servers: info: error (network unreachable) resolving 'rh120ns1.368.dhhs.gov/A/IN': 2607:f220:0:1::2a#53 01-Mar-2016 09:10:02.066 lame-servers: info: error (network unreachable) resolving 'rh202ns2.355.dhhs.gov/A/IN': 2607:f220:0:1::2a#53 01-Mar-2016 09:10:02.066 lame-servers: info: error (network unreachable) resolving 'rh120ns2.368.dhhs.gov/A/IN': 2607:f220:0:1::2d#53 01-Mar-2016 09:10:02.066 lame-servers: info: error (network unreachable) resolving 'rh202ns2.355.dhhs.gov/A/IN': 2607:f220:0:1::2d#53 01-Mar-2016 09:10:02.067 lame-servers: info: error (network unreachable) resolving 'rh202ns1.355.dhhs.gov/A/IN': 2607:f220:0:1::2d#53 01-Mar-2016 09:10:02.067 lame-servers: info: error (network unreachable) resolving 'rh120ns2.368.dhhs.gov/A/IN': 2607:f220:0:1::2b#53 01-Mar-2016 09:10:02.067 lame-servers: info: error (network unreachable) resolving 'rh120ns1.368.dhhs.gov/A/IN': 2607:f220:0:1::2d#53 01-Mar-2016 09:10:02.067 lame-servers: info: error (network unreachable) resolving 'rh202ns2.355.dhhs.gov/A/IN': 2607:f220:0:1::2b#53 01-Mar-2016 09:10:02.067 lame-servers: info: error (network unreachable) resolving 'rh202ns1.355.dhhs.gov/A/IN': 2607:f220:0:1::2b#53 01-Mar-2016 09:10:02.067 lame-servers: info: error (network unreachable) resolving 'rh120ns1.368.dhhs.gov/A/IN': 2607:f220:0:1::2b#53 If I dump the cache, the only information in the cache for the nameservers in question are the AAAA records: rh202ns1.355.dhhs.gov. 56878 AAAA 2607:f220:0:1::2a rh202ns2.355.dhhs.gov. 56878 AAAA 2607:f220:0:1::2b rh120ns1.368.dhhs.gov. 56878 AAAA 2607:f220:0:1::2c rh120ns2.368.dhhs.gov. 56878 AAAA 2607:f220:0:1::2d If I look at the queries the recursive resolver issued at the same time as this failure (which I captured via ngrep), I see it attempt to refresh the A records for the dhhs.gov nameservers by performing recursive resolution from the root servers. Based on the capture, everything appears to be legitimate. And indeed, I can successfully recursively resolve the A records for all 4 nameservers with "dig +trace +dnssec". If I flush these records from the cache, then retry the hhs.gov query, it succeeds, and then the cache contains: rh202ns1.355.dhhs.gov. 86114 A 158.74.30.98 86114 AAAA 2607:f220:0:1::2a rh202ns2.355.dhhs.gov. 86114 A 158.74.30.99 86114 AAAA 2607:f220:0:1::2b rh120ns1.368.dhhs.gov. 86114 A 158.74.30.102 86356 AAAA 2607:f220:0:1::2c rh120ns2.368.dhhs.gov. 86114 A 158.74.30.103 86114 AAAA 2607:f220:0:1::2d So: it seems like something goes wrong when BIND attempts to refresh the A records for the above nameservers, and as a result, BIND thinks that these nameservers only have AAAA addresses. Because our recursive resolver does not have an IPv6 interface, all queries for all zones served by the above nameservers (and there are a bunch more than just hhs.gov, alas) return SERVFAIL. We can work around this by adding a cron job to call "rndc flushname" on the above records when queries for hhs.gov return SERVFAIL. But we'd really love to know why this happens in the first place. Can anyone else reproduce this? (E.g., set up a cron job up an IPv4-only host to run "dig hhs.gov mx" every 5 minutes or so, and see when/if the dig starts returning SERVFAIL.) Is something subtly broken with the DNS resolution path for these nameservers? Have we misconfigured our recursive resolver in some way? Is there a bug in the version of BIND we're running? Something else? Any thoughts/guesses appreciated. _______________________________________________ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users