Hey all, Well I'm reaching out as I'm at a loss. I have a distributed DNS architecture with 2 bind-9.7.2-P3 servers behind an F5 Loadbalancer. I then have another 2 behind another F5 at another location.
My app servers are configured with their resolv.conf looking like: (please ignore the domain and networks, they have been altered) search gc.domain.net domain gc.domain.net nameserver 1.1.1.15 nameserver 1.1.2.56 options timeout:1 What I'm finding out is that there are a ton of requests being made to the 1.1.2.56 address. In reality the servers at 1.1.1.15 (again behind the F5) are healthy, no retransmissions, no excessive load nothing that tells me they are having issues. Yet my servers seem to fail to connect to them and must failover to the secondary DNS servers (again I don't understand why, nor can I figure out why). If I run a script that does a dig I can't seem to get it to failover to the secondary DNS, but something in code or other that uses gethostbyname or the host command seem to cause a lookup fail and thus it fails over to the secondary nodes, across the internet in fact. Is there a documented method to troubleshoot, debug why a system believes that they were unable to get an acceptable results from the primary DNS server? Doesn't appear to be any health related issues, so I'm at a loss. I feel the DNS infrastructure is healthy but at this point I need some assistance proving that it's not and therefore fixing it! I've added the 1 Second timeout since I was seeing 5 second delays in our application and again this was due to it waiting for the primary server to respond before it could failover, now after a second it just goes to the secondary dns and seems to be happy (most of the time, I'm getting some hard failures that I'm trying to troubleshoot as well). Thanks Tory _______________________________________________ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users