So the cache servers are HA behind something (F5 LTM, Cisco local director, something else). Are the authoritative servers? It would seem sensible to do the same with them. That way a timeout only occurs if the whole HA cluster is unavailable. You can alleviate even that situation by seeding the cache servers every (TTL-some value) minutes. Or slaving the domain on the cache servers.
On 14/09/10 11:34 AM, "Howard Wilkinson" <how...@cohtech.com> wrote: > I have been working on building out a couple of large data centres and > have been struggling with how to set up the systems so that we get a high > resilience, highly responsive DNS service in the presence of failing > equipment. > > The configuration we have adopted includes a layer of BIND 9.6.x servers > that act as pure name server caches. We have six of these servers in each > data centre paired to provide service on VIPs so that if one of the pair > fails the other cache takes over. > > Our resolv.conf is of the following form. > > search xxx.com yyy.com > nameserver 10.1.1.1 > nameserver 10.1.2.1 > nameserver 10.1.3.1 > options timeout:1 attempts:15 no-check-names rotate > > The name servers are thus on different networks within the DCs. > > Our first problem arises because the timeouts seem to be taken serially on > each server rather than the rotate applying between each name server > request. Is this what I should have expected i.e. a 15 second timeout > before the next server is tried in sequence. > > The second problem we face is that even if we could get a one second > timeout this orders of magnitude too slow for names that should be > resolved within our local name space. In other words for lookups within > the xxx.com and yyy.com domains I would like to see timeouts in the > micro-second range. > > Thinking further about this problem I have been considering whether the > resolver should be multi-threaded or parallelised in some way so that it > tries all fo the servers at once and accepts the first to respond. I have > come to the conclusion that this would be too difficult to make resilient > in the general use of the resolver code, but would make sense if the > lwresd layer is added to the equation. > > Which brings me on to the use of lwresd, this would reduce the incidence > of problems with non-responsive servers in that it would detect and switch > to an alternative server on the first failed attempt. However, this still > means that if lwresd has not detected the down server then we get a stall > in response within the data centre. > > So my questions are: > > 1. Does anybody have any experience in building such systems and > suggestions on how we should tune the clients and servers to make the > system less fragile in the presence of hardware, software and network > failures. > > 2. Is is possible with lwresd as it is written today to get the effect of > precognition - i.e. can I get lwresd to notice that a server has gone down > or has come back up without it needing to be triggered by a resolv > request. > > 3. Does anybody know if I can configure lwresd to expect particular zones > to be resolved within very small windows and use this to fail over to the > next server. > > And for discussion I wonder if there would be room to add to the resolver > code and or lwresd additional options of the form > > options zone-timeout: xxx.com:1usec > > or something similar, whereby the resolver could be told that if the cache > does not respond within this time about that particular zone then it can > be assumed that the server is misbehaving. > > Thank you for your attention > > Regards, Howard. > > _______________________________________________ > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users -- Kal Feher _______________________________________________ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users