Hi, I've made some performance adjustments although I really don't know whether it's correct, and it doesn't seem to have solved the problem. I also notice the SERVFAIL error seems to happen in bulk - it will happen for a while and then stop. It definitely seems to occur more during peak mail volume (this is a mail server).
max-clients-per-query 4000; clients-per-query 4000; recursive-clients 4000; tcp-clients 4000; Here's the named_stats.txt file from "rndc stats": +++ Statistics Dump +++ (1532630822) ++ Incoming Requests ++ 3267 QUERY ++ Incoming Queries ++ 2345 A 74 NS 69 PTR 152 MX 569 TXT 58 AAAA ++ Outgoing Rcodes ++ 1356 NOERROR 648 SERVFAIL 1070 NXDOMAIN ++ Outgoing Queries ++ [View: default] 8749 A 139 NS 133 PTR 30 MX 640 TXT 6 AAAA 488 DS 87 DNSKEY [View: _bind] ++ Name Server Statistics ++ 3267 IPv4 requests received 2026 requests with EDNS(0) received 6 TCP requests received 3074 responses sent 6 truncated responses sent 1883 responses with EDNS(0) sent 1134 queries resulted in successful answer 2426 queries resulted in non authoritative answer 222 queries resulted in nxrrset 648 queries resulted in SERVFAIL 1070 queries resulted in NXDOMAIN 2190 queries caused recursion 33 duplicate queries received 4 queries dropped 156 recursing clients 3249 UDP queries received 6 TCP queries received ++ Zone Maintenance Statistics ++ ++ Resolver Statistics ++ [Common] 143 UDP queries in progress [View: default] 10272 IPv4 queries sent 2503 IPv4 responses received 611 NXDOMAIN received 1 SERVFAIL received 16 FORMERR received 14 EDNS(0) query failures 448 truncated responses received 7865 query retries 7674 query timeouts 380 IPv4 NS address fetches 33 IPv4 NS address fetch failed 1129 DNSSEC validation attempted 348 DNSSEC validation succeeded 741 DNSSEC NX validation succeeded 1 DNSSEC validation failed 78 queries with RTT < 10ms 1394 queries with RTT 10-100ms 981 queries with RTT 100-500ms 6 queries with RTT 500-800ms 1 queries with RTT 800-1600ms 150 active fetches 523 bucket size 3 REFUSED received 6146 COOKIE send with client cookie only 393 COOKIE sent with client and server cookie 291 COOKIE replies received 291 COOKIE client ok [View: _bind] 523 bucket size ++ Cache Statistics ++ [View: default] 22101 cache hits 13 cache misses 5896 cache hits (from query) 3416 cache misses (from query) 0 cache records deleted due to memory exhaustion 0 cache records deleted due to TTL expiration 2096 cache database nodes 1039 cache database hash buckets 1352276 cache tree memory total 1022492 cache tree memory in use 1022548 cache tree highest memory in use 393216 cache heap memory total 132096 cache heap memory in use 132096 cache heap highest memory in use [View: _bind (Cache: _bind)] 0 cache hits 0 cache misses 0 cache hits (from query) 0 cache misses (from query) 0 cache records deleted due to memory exhaustion 0 cache records deleted due to TTL expiration 0 cache database nodes 64 cache database hash buckets 287792 cache tree memory total 29952 cache tree memory in use 29952 cache tree highest memory in use 262144 cache heap memory total 1024 cache heap memory in use 1024 cache heap highest memory in use ++ Cache DB RRsets ++ [View: default] 963 A 299 NS 14 CNAME 23 PTR 19 MX 47 TXT 400 AAAA 57 DS 193 RRSIG 33 NSEC 34 DNSKEY 3 !A 2 !NS 1 !MX 19 !TXT 1 !AAAA 122 !DS 557 NXDOMAIN 1 #RRSIG 1 #NSEC [View: _bind (Cache: _bind)] ++ ADB stats ++ [View: default] 1021 Address hash table size 916 Addresses in hash table 1021 Name hash table size 1035 Names in hash table [View: _bind] 1021 Address hash table size 1021 Name hash table size ++ Socket I/O Statistics ++ 9861 UDP/IPv4 sockets opened 450 TCP/IPv4 sockets opened 1 Raw sockets opened 9711 UDP/IPv4 sockets closed 454 TCP/IPv4 sockets closed 30 UDP/IPv4 socket bind failures 9824 UDP/IPv4 connections established 446 TCP/IPv4 connections established 7 TCP/IPv4 connections accepted 43 UDP/IPv4 recv errors 150 UDP/IPv4 sockets active 3 TCP/IPv4 sockets active 1 Raw sockets active ++ Per Zone Query Statistics ++ --- Statistics Dump --- (1532630822) +++ Statistics Dump +++ (1532634389) ++ Incoming Requests ++ 26879 QUERY ++ Incoming Queries ++ 18386 A 642 NS 351 PTR 1186 MX 5626 TXT 688 AAAA ++ Outgoing Rcodes ++ 12312 NOERROR 3066 SERVFAIL 11270 NXDOMAIN ++ Outgoing Queries ++ [View: default] 57901 A 1761 NS 566 PTR 555 MX 4177 TXT 87 AAAA 2 DNSKEY [View: _bind] ++ Name Server Statistics ++ 26879 IPv4 requests received 16404 requests with EDNS(0) received 168 TCP requests received 26648 responses sent 168 truncated responses sent 16357 responses with EDNS(0) sent 10556 queries resulted in successful answer 23582 queries resulted in non authoritative answer 1756 queries resulted in nxrrset 3066 queries resulted in SERVFAIL 11270 queries resulted in NXDOMAIN 14505 queries caused recursion 231 duplicate queries received 26693 UDP queries received 168 TCP queries received 2 COOKIE option received 2 COOKIE - client only ++ Zone Maintenance Statistics ++ ++ Resolver Statistics ++ [Common] [View: default] 65049 IPv4 queries sent 12813 IPv4 responses received 7832 NXDOMAIN received 5 SERVFAIL received 32 FORMERR received 26 EDNS(0) query failures 530 truncated responses received 4 lame delegations received 50747 query retries 52327 query timeouts 1038 IPv4 NS address fetches 205 IPv4 NS address fetch failed 706 queries with RTT < 10ms 7423 queries with RTT 10-100ms 4076 queries with RTT 100-500ms 342 queries with RTT 500-800ms 39 queries with RTT 800-1600ms 9 queries with RTT > 1600ms 523 bucket size 6 REFUSED received 20513 COOKIE send with client cookie only 1485 COOKIE sent with client and server cookie 921 COOKIE replies received 921 COOKIE client ok [View: _bind] 523 bucket size ++ Cache Statistics ++ [View: default] 158038 cache hits 13 cache misses 62750 cache hits (from query) 19356 cache misses (from query) 0 cache records deleted due to memory exhaustion 126 cache records deleted due to TTL expiration 12112 cache database nodes 4159 cache database hash buckets 4822015 cache tree memory total 4393804 cache tree memory in use 4394140 cache tree highest memory in use 393216 cache heap memory total 132096 cache heap memory in use 132096 cache heap highest memory in use [View: _bind (Cache: _bind)] 0 cache hits 0 cache misses 0 cache hits (from query) 0 cache misses (from query) 0 cache records deleted due to memory exhaustion 0 cache records deleted due to TTL expiration 0 cache database nodes 64 cache database hash buckets 293568 cache tree memory total 29952 cache tree memory in use 35728 cache tree highest memory in use 262144 cache heap memory total 1024 cache heap memory in use 1024 cache heap highest memory in use ++ Cache DB RRsets ++ [View: default] 3060 A 863 NS 302 CNAME 81 PTR 77 MX 186 TXT 1152 AAAA 85 DS 259 RRSIG 80 NSEC 1 DNSKEY 28 !A 27 !NS 2 !MX 94 !TXT 5 !AAAA 6192 NXDOMAIN [View: _bind (Cache: _bind)] ++ ADB stats ++ [View: default] 1021 Address hash table size 2125 Addresses in hash table 1021 Name hash table size 1427 Names in hash table [View: _bind] 1021 Address hash table size 1021 Name hash table size ++ Socket I/O Statistics ++ 64830 UDP/IPv4 sockets opened 532 TCP/IPv4 sockets opened 1 Raw sockets opened 64823 UDP/IPv4 sockets closed 726 TCP/IPv4 sockets closed 304 UDP/IPv4 socket bind failures 64519 UDP/IPv4 connections established 519 TCP/IPv4 connections established 197 TCP/IPv4 connections accepted 218 UDP/IPv4 recv errors 7 UDP/IPv4 sockets active 3 TCP/IPv4 sockets active 1 Raw sockets active ++ Per Zone Query Statistics ++ --- Statistics Dump --- (1532634389) On Thu, Jul 26, 2018 at 2:51 PM, Alex <mysqlstud...@gmail.com> wrote: > Hi, > > On Thu, Jul 26, 2018 at 1:57 PM, John Miller <johnm...@brandeis.edu> wrote: >> Hi Alex, >> >> What does your query volume look like on this server? Depending on >> volume, the BIND defaults for: >> >> - clients-per-query >> - max-clients-per-query >> - recursive-clients >> - tcp-clients >> >> and others may not be set high enough. Check pp. 106-108 in the >> latest 9.11 manual for more details on each of these. >> >> Of course, if you're only seeing SERVFAIL for a handful of domains, >> then they may have some sort of delegation issue, or there might be a >> network issue between your caching servers and them. > > I think it's happening more frequently than for just a remote > misconfigured system. Here is my rndc status, but it doesn't appear to > provide all values you've requested. > > It's also occurring for queries to trustworthy remote sources: > 26-Jul-2018 14:48:22.975 query-errors: debug 1: client @0x7fddb400c570 > 127.0.0.1#56094 (mail-dm3nam03on0041.outbound.protection.outlook.com): > query failed (SERVFAIL) for > mail-dm3nam03on0041.outbound.protection.outlook.com/IN/A at > ../../../bin/named/query.c:8580 > > # rndc status > version: BIND 9.11.4-RedHat-9.11.4-1.fc28 (Extended Support Version) > <id:2fe4344> > running on bwimail03.guardiandigital.com: Linux x86_64 > 4.17.7-200.fc28.x86_64 #1 SMP Tue Jul 17 16:28:31 UTC 2018 > boot time: Thu, 26 Jul 2018 18:47:52 GMT > last configured: Thu, 26 Jul 2018 18:47:52 GMT > configuration file: /etc/named.conf (/var/named/chroot/etc/named.conf) > CPUs found: 8 > worker threads: 8 > UDP listeners per interface: 7 > number of zones: 103 (97 automatic) > debug level: 0 > xfers running: 0 > xfers deferred: 0 > soa queries in progress: 0 > query logging is OFF > recursive clients: 63/900/1000 > tcp clients: 0/150 > server is up and running > > I've also now confirmed it's happening at times of regular network > activity. I'm really stuck. I hope someone can help. > > Thanks, > Alex > > >> >> John >> >> >> On Thu, Jul 26, 2018 at 1:07 PM, Alex <mysqlstud...@gmail.com> wrote: >>> Hi, >>> >>> I have a bind-9.11.4 server on a fedora28 system and are frequently >>> seeing SERVFAIL errors like this: >>> >>> 26-Jul-2018 12:54:04.255 query-errors: info: client @0x7f764314a5c0 >>> 127.0.0.1#50719 (223.178.102.199.cidr.bl.mcafee.com): query failed >>> (SERVFAIL) for 223.178.102.199.cidr.bl.mcafee.com/IN/A at >>> ../../../bin/named/query.c:4140 >>> >>> I believe this happens more frequently at times of peak link >>> utilization, but it also appears to happen during normal times. >>> >>> This is a local caching server I've set up but it also appears to >>> exist on other systems that have been set up to be authoritative for >>> our domain. >>> >>> How can I troubleshoot this further? >>> >>> Here is the named.conf for this caching server: >>> >>> acl "trusted" { >>> { 127/8; }; >>> { 68.195.191.40/29; }; >>> { 192.168.1.0/24; }; >>> { 107.155.67.2/32; }; >>> }; >>> >>> options { >>> listen-on port 53 { 127.0.0.1; 68.195.191.45; }; >>> listen-on-v6 port 53 { none; }; >>> directory "/var/named"; >>> dump-file "/var/named/data/cache_dump.db"; >>> statistics-file "/var/named/data/named.stats"; // >>> _PATH_STATS >>> memstatistics-file "/var/named/data/named.memstats"; // >>> _PATH_MEMSTATS >>> allow-query { trusted; }; >>> recursion yes; >>> zone-statistics yes; >>> >>> // dnssec-enable yes; >>> // dnssec-validation yes; >>> // dnssec-lookaside auto; >>> >>> dnssec-enable no; >>> dnssec-validation no; >>> dnssec-lookaside no; >>> >>> /* Path to ISC DLV key */ >>> bindkeys-file "/etc/named.iscdlv.key"; >>> >>> managed-keys-directory "/var/named/dynamic"; >>> >>> }; >>> >>> logging { >>> channel default_debug { >>> file "data/named.run"; >>> severity dynamic; >>> }; >>> >>> // Record all queries to the box for now >>> channel query_info { >>> severity info; >>> file "/var/log/named.query.log" versions 3 size 10m; >>> print-time yes; >>> print-category yes; >>> }; >>> >>> // added for fail2ban support >>> channel security_file { >>> severity dynamic; >>> file "/var/log/named.security.log" versions 3 size 30m; >>> print-time yes; >>> print-category yes; >>> }; >>> >>> channel b_debug { >>> file "/var/log/named.debug.log" versions 2 size 10m; >>> print-time yes; >>> print-category yes; >>> print-severity yes; >>> severity dynamic; >>> }; >>> >>> // Send the security related messages to a separate file. >>> channel audit_log { >>> file "/var/log/named.audit.log" versions 4 size 10m; >>> severity info; >>> print-time yes; >>> print-category yes; >>> }; >>> >>> >>> category queries { query_info; }; >>> category default { b_debug; }; >>> category config { b_debug; }; >>> category security { security_file; }; >>> // category lame-servers { audit_log; }; >>> category lame-servers { null; }; >>> >>> }; >>> >>> zone "." IN { >>> type hint; >>> file "/var/named/named.ca"; >>> }; >>> >>> zone "localhost.localdomain" IN { >>> type master; >>> file "named.localhost"; >>> allow-update { none; }; >>> }; >>> >>> zone "localhost" IN { >>> type master; >>> file "named.localhost"; >>> allow-update { none; }; >>> }; >>> >>> zone >>> "1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa" >>> IN { >>> type master; >>> file "named.loopback"; >>> allow-update { none; }; >>> }; >>> >>> zone "1.0.0.127.in-addr.arpa" IN { >>> type master; >>> file "named.loopback"; >>> allow-update { none; }; >>> }; >>> >>> zone "0.in-addr.arpa" IN { >>> type master; >>> file "named.empty"; >>> allow-update { none; }; >>> }; >>> >>> include "/etc/named.root.key"; >>> include "/etc/rndc.key"; >>> _______________________________________________ >> _______________________________________________ >> Please visit https://lists.isc.org/mailman/listinfo/bind-users to >> unsubscribe from this list >> >> bind-users mailing list >> bind-users@lists.isc.org >> https://lists.isc.org/mailman/listinfo/bind-users _______________________________________________ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users