I have just implemented DNS anycasting on our inside network using Cisco content switches to monitor the health of the servers and to advertise an OSPF route when the back-end services are alive. I have three CSS's simultaneously advertising the same service address to the network, and clients get routed to the nearest one. It works great.

Anyone else try this?

When I was testing, I sent 2000 queries per second from two sources simultaneously on diverse parts of the network, and proceeded to start disconnecting and reconnecting cables on the content switches to see how well it all worked. No matter what I did, I could not seem to lose more than 10 packets per link-state-change (which is very good in my mind). But when I stopped the services on the actual servers, it took up to 5 seconds before the content switch registered the fault (because the keepalives are currently configured for every 5 seconds), and I lost thousands of queries in those few seconds.

I am considering reducing the keepalive period to improve this fault response, but I'd like to get a better understanding of the DNS client behavior when it's queries go unanswered.

From what I recall, the typical DNS client will send a single query packet
to its first-configured dns resolver and wait 1 second for a response. If no response comes, the DNS client sends a second query to the same dns resolver and waits either 1 second or 2 seconds, depending on if the client is progressive or not, for a response. If still no response comes, most DNS clients will ask the same dns resolver one last time, and wait either 1 more second or 4 seconds, depending on the client. And perhaps some non-progressive DNS clients try a fourth time. If still no response comes, then the DNS client starts from the beginning with the second-configured DNS resolver.

If this is true, then I would think a keepalive period of 3 seconds ought to divert queries away from dead servers fast enough to satisfy the vast majority of DNS client requests before failing over to the second-configured dns resolver.

Any comments?

And despite what I have read about DNS clients over the years, what I have experienced in real life has left me uncertain about what really happens. Typically, prior to this anycast deployment, when our first-configured dns resolver went down, users complained about waiting 60 to 90 seconds before their web pages would come up. That does not make sense to me because I thought the second-configured resolver would be used within a few seconds.

Can any suggest why real life doesn't reflect what is written?

Thanks.

--
Gordon A. Lang

_______________________________________________
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Reply via email to