I apologize for being so slow to respond; it has been "one of those weeks", but 
I do very much appreciate everyone's comments.
The machines are on the same LAN, and there is no evidence of unusual load or 
dropped packets on the network.

We do not have any firewall rules restricting DNS traffic.  We _do_ have 
state-aware firewalld rules on the machines that should not apply to DNS 
traffic - the default baseline firewalld rules that come to us from redhat 
include rules that check state.

The reason that I mention rules that should not apply to DNS traffic is that I 
found one of my colleagues who worked on this problem a few years back.  They 
found at the time that it was the fact of loading the conntrack module by _any_ 
rule that caused the fault, regardless of which rule actually used the data.  
We're talking back a major release or two of redhat and everything else, so we 
are not assuming that this is necessarily the exact same problem, but to test 
it on our current systems I'll have to temporarily pull a server out of prod 
(the problem is not reproduceable under the load we get in test.)  So next 
week, at this point.

(They solved the problem last time by disabling _all_ state-aware rules from 
iptables, but my sysadmins are resisting a similar approach this time, so I am 
attempting to find an alternate solution...)

Thanks,

     - rob.




From: Matthew Pounsett <m...@conundrum.com>
Sent: Friday, May 18, 2018 11:08 AM
To: Rob Moser
Cc: bind-users@lists.isc.org
Subject: Re: Intermittent "failure trying master... operation canceled" on zone 
refresh

On 17 May 2018 at 17:05, Rob Moser <rob.mo...@nau.edu> wrote:

We're running a series of RHEL 7.4 machines (kernel version 
3.10.0-693.1.1.el7.x86_64) running bind version 9.9.4-RedHat-9.9.4-51.el7.  Our 
configuration consists of a hidden master and three hidden slave/recursive 
resolvers.  I'm getting a LOT of errors  on the slaves that look like:

17-May-2018 13:27:28.421 general: info: zone 
34.22.10.in-addr.arpa/IN/internal-view: refresh: failure trying master 
10.20.30.3#53 (source 0.0.0.0#0): operation canceled
   
In addition to checking for firewalls and other stateful network devices as 
Tony mentions, you should also have a look at the condition of the network in 
between the hosts.  That feels a lot like moderate packet loss, or extreme 
latency, to me.  


Are these machines all on the same LAN?  Are there multiple networks in between 
them?



       
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Reply via email to