I have been working on building out a couple of large data centres and have been struggling with how to set up the systems so that we get a high resilience, highly responsive DNS service in the presence of failing equipment.
The configuration we have adopted includes a layer of BIND 9.6.x servers that act as pure name server caches. We have six of these servers in each data centre paired to provide service on VIPs so that if one of the pair fails the other cache takes over. Our resolv.conf is of the following form. search xxx.com yyy.com nameserver 10.1.1.1 nameserver 10.1.2.1 nameserver 10.1.3.1 options timeout:1 attempts:15 no-check-names rotate The name servers are thus on different networks within the DCs. Our first problem arises because the timeouts seem to be taken serially on each server rather than the rotate applying between each name server request. Is this what I should have expected i.e. a 15 second timeout before the next server is tried in sequence. The second problem we face is that even if we could get a one second timeout this orders of magnitude too slow for names that should be resolved within our local name space. In other words for lookups within the xxx.com and yyy.com domains I would like to see timeouts in the micro-second range. Thinking further about this problem I have been considering whether the resolver should be multi-threaded or parallelised in some way so that it tries all fo the servers at once and accepts the first to respond. I have come to the conclusion that this would be too difficult to make resilient in the general use of the resolver code, but would make sense if the lwresd layer is added to the equation. Which brings me on to the use of lwresd, this would reduce the incidence of problems with non-responsive servers in that it would detect and switch to an alternative server on the first failed attempt. However, this still means that if lwresd has not detected the down server then we get a stall in response within the data centre. So my questions are: 1. Does anybody have any experience in building such systems and suggestions on how we should tune the clients and servers to make the system less fragile in the presence of hardware, software and network failures. 2. Is is possible with lwresd as it is written today to get the effect of precognition - i.e. can I get lwresd to notice that a server has gone down or has come back up without it needing to be triggered by a resolv request. 3. Does anybody know if I can configure lwresd to expect particular zones to be resolved within very small windows and use this to fail over to the next server. And for discussion I wonder if there would be room to add to the resolver code and or lwresd additional options of the form options zone-timeout: xxx.com:1usec or something similar, whereby the resolver could be told that if the cache does not respond within this time about that particular zone then it can be assumed that the server is misbehaving. Thank you for your attention Regards, Howard. _______________________________________________ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users