The problem reported below proves to have been resolved by this change: 2797. [bug] Don't decrement the dispatch manager's maxbuffers. [RT #20613]
When randomized query ports was implemented, the increase in the number concurrently-used sockets had an equivalent increased usage need of another resource - the dispatch manager buffer pool. This was of course enlarged too, but an oversight meant that it could be reduced again in some circumstances. The reason that the rndc reconfig buys temporary relief is that it runs through the configuration file again and revisits and reapplies the initial large pool size decision. The fix is currently available in 9.7.0rc1 and 9.6.2b1 will be included in the upcoming BIND Extended Support Versions (ESVs). Imri Zvik wrote: > Hi, > > We've recently upgraded our caching servers to 9.4.3-P4/P3 (2 of them running > 9.4.3-P4 and 2 running 9.4.3-P3). Few days ago I've noticed something > strange - When the server is loaded, some queries randomly fails (SERVFAIL). > It seems that only queries for which the answer is NOT cached are affected. > I've verified with host/dig and tcpdump that there is no network issue (no > unanswered packets). Digging deeper into the issue, I've found that the issue > appears when the number of sockets used by named approach 1024~ (checked with > netstat/lsof). The weirdest part, is that if I run "rndc reconfig", suddenly > named is able to use more than 1024 sockets (I've seen it using 4000-5000~ > sockets), and the problem goes away for about an hour. > > If I downgrade to 3.4.2-P2 the problems goes away. > > I used the following command to reproduce the problem: > for i in {1..100000}; do dig mx www.cnn.com @localhost |grep status |grep -v > NOERROR; done > > My servers are running RHEL 5.4 (2.6.18-164.9.1.el5) and FreeBSD 7.0 (the > problem is seen on both), and they are splitted into two, unrelated, > networks, and on two separate physical locations. > > I've compiled bind from the vanilla ISC sources using the following configure > command: > > ./configure --enable-threads --enable-largefile --prefix=/usr/local > > I've also tried the following (I've also raised the OS limits, of course): > STD_CDEFINES="-DISC_SOCKET_FDSETSIZE=1048576" ./configure --enable-threads > --enable-largefile --prefix=/usr/local > > As I was seeing the "general: error: socket: file descriptor exceeds limit > (4096/4096)" error a couple of days ago. > > My best guess is that the problem is related to the recent move to epoll... > > Any ideas on how I should proceed from here? > _______________________________________________ > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users _______________________________________________ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users