I apologize for being so slow to respond; it has been "one of those weeks", but I do very much appreciate everyone's comments.
The machines are on the same LAN, and there is no evidence of unusual load or dropped packets on the network. We do not have any firewall rules restricting DNS traffic. We _do_ have state-aware firewalld rules on the machines that should not apply to DNS traffic - the default baseline firewalld rules that come to us from redhat include rules that check state. The reason that I mention rules that should not apply to DNS traffic is that I found one of my colleagues who worked on this problem a few years back. They found at the time that it was the fact of loading the conntrack module by _any_ rule that caused the fault, regardless of which rule actually used the data. We're talking back a major release or two of redhat and everything else, so we are not assuming that this is necessarily the exact same problem, but to test it on our current systems I'll have to temporarily pull a server out of prod (the problem is not reproduceable under the load we get in test.) So next week, at this point. (They solved the problem last time by disabling _all_ state-aware rules from iptables, but my sysadmins are resisting a similar approach this time, so I am attempting to find an alternate solution...) Thanks, - rob. From: Matthew Pounsett <m...@conundrum.com> Sent: Friday, May 18, 2018 11:08 AM To: Rob Moser Cc: bind-users@lists.isc.org Subject: Re: Intermittent "failure trying master... operation canceled" on zone refresh On 17 May 2018 at 17:05, Rob Moser <rob.mo...@nau.edu> wrote: We're running a series of RHEL 7.4 machines (kernel version 3.10.0-693.1.1.el7.x86_64) running bind version 9.9.4-RedHat-9.9.4-51.el7. Our configuration consists of a hidden master and three hidden slave/recursive resolvers. I'm getting a LOT of errors on the slaves that look like: 17-May-2018 13:27:28.421 general: info: zone 34.22.10.in-addr.arpa/IN/internal-view: refresh: failure trying master 10.20.30.3#53 (source 0.0.0.0#0): operation canceled In addition to checking for firewalls and other stateful network devices as Tony mentions, you should also have a look at the condition of the network in between the hosts. That feels a lot like moderate packet loss, or extreme latency, to me. Are these machines all on the same LAN? Are there multiple networks in between them? _______________________________________________ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users