Fabien Seisen wrote: >> This doesn't sound like a hugely loaded server, >> > exact, on my own test (with "real life" queries), the server can handle > ~70000 queries/s with response time ~1ms at 70% cpu and no > packet lost. > >> else it's somewhat throttled (not particularly large cache and probably >> default limit on recursive clients). What kind of query rates do you >> have? Do you get any logging that suggests resource problems? If so, >> you might need to ncrease some of the limits. > > We have a pool of several more or less identicals servers with a > load-balancer in front. > > On average, each server gets 1800 queries/s and 4000 at peak. > > The problem occurs every few weeks and never on all servers at a time. > > Recursive clients config is not modified (rndc status: recursive clients: > 188/2900/3000) and we have > - on avg: 200 recursive clients > - at peak 600
OK - so reasonably well-loaded, but not struggling and doesn't sound like it should be hitting any resource problems (although I think your max-cache-size might be on the small side - would you consider increasing it? Are you doing a 64 or 32 bit build?) > It's intriguing that you're seeing the same issues on two bind versions >> and two OS (and that other people's experience is different from yours) >> > only Solaris 10 > - Solaris 10 U6 with bind 9.5.1-P3 with threads compiled with SUNSpro 12 > - Solaris 10 U6 with bind 9.6.2 with threads compiled with gcc > >> - it suggests to me that it's specific to your configuration or client >> base/queries or your environment. >> > > we gets real life queries from customers (evil?). Well, the nameserver is there to answer queries - good, bad, ill-considered, typos etc.. And it should accommodate them all. > A simple "rndc flush" revives named. > > Perhaps, a bad formated packet freeze named or create a cache dead lock > > Can something go wrong in the cache ? Yes, sometimes there can be cache contamination (usually confined to a particular domain though, and due to admin mistakes on the part of that domain's owners). It would surprise me to find cache contamination with this far-reaching an effect, although it's not unknown where you're using forwarding and your forwarders use different root or high level domain NS records. It's interesting that rndc flush clears the problem - so it might be cache-related. You could take a 'normal' cache dump and then a cache dump when the problem is ongoing. Look particularly for NS/A record pairs - fairly high up in your resolution path (I say 'resolution path' because I don't know if you're resolving directly or via any forwarding) that are incorrect. Use "rndc dumpdb -all". Good luck - it can be a bit like looking for a needle in a haystack. > I am not fluent with core files but i have got one in my pocket. A core file is useful for seeing what named was doing at the instant that it was created. It may or may not be useful in this case because it's only a snapshot. It would show you a deadlock for example, but where named is not hung, just not doing what you expect, then the snapshot is often not what you need for troubleshooting. You would need gdb or dbx to analyze it, along with the exact same binary that created it, and preferably on the same box (so that the dynamic libs match). > For troubleshooting I'd start by looking at the logging output - if >> you've got any categories going to null, un-suppress them temporarily; >> and add query-errors (see 9.6.2 ARM). Then perhaps do some sampling of >> network traffic (perhaps there's a UDP message size/fragmentation issue) >> to see what's happening (or not). >> > > all category to non-null and we do not use specific 9.6.2 configuration. > I did not noticied weird log message (beside regular: shutting down due to > TCP receive error: 202.96.209.6#53: connection reset) > > here is our log config: > category client { client.log; }; > category config { config.log; default_syslog; }; > category database { database.log; default_syslog; }; > category default { default.log; default_syslog; }; > category delegation-only { delegation-only.log; }; > category dispatch { dispatch.log; }; > category general { default.log; }; > category lame-servers { lamers.log; }; > category network { network.log; }; > category notify { notify.log; default_syslog; }; > category queries { queries.log; }; > category resolver { resolver.log; }; > category security { security; }; > category unmatched { unmatched.log; }; > category update { update.log; }; > category xfer-in { xfer-in.log; default_syslog; }; > category xfer-out { xfer-out.log; default_syslog; }; The other side of this is the various logging channels used by these categories - what level are they logging at? I would definitely recommend the new category query-errors for your 9.6.2 build - you set it up to log to its own channel with level dynamic, then when things start to go wrong, increase the trace level via rndc to log at debug level 2 and see if there are any clues in what you're seeing (I'd also recommend sampling what's output here when named is running normally too - failing to resolve sometimes is expected behaviour!) (Also note that debug level 2 can be rather busy in the other categories too). Cathy _______________________________________________ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users