Otto, It took me a while to come back to this but I made changes as per your suggestions shortly after your last reply. - I reverted the max-negative-ttl to default. Performance seems markedly improved. - I removed the lua so no drops will occur and many server clients seem much happier. - I've begun collecting the metrics available by the API and graphing them to watch for trending patterns.
I mostly just wanted to say thank you for the support and I will start a new thread should I need further assistance in the future. Sincerely, Scotsie On Sat, Apr 19, 2025 at 3:29 AM Otto Moerbeek <o...@drijf.net> wrote: > Remarks inline. > > On Fri, Apr 18, 2025 at 07:04:18PM -0400, Scott Crace wrote: > > > Otto, > > Thanks for your assistance.Since these were setup with private IPs I > wasn't > > sure how useful the config would be however, I have included it below. > > > > # rec_control dump-throttlemap - > > ; throttle map dump follows > > ; remote IP qname qtype count ttd reason > > 10.0.196.197 0.10.in-addr.arpa A 2 > 2025-04-18T18:44:22 > > RCodeRefused > > 10.0.196.197 10.10.in-addr.arpa A 3 > 2025-04-18T18:44:25 > > RCodeRefused > > 10.0.196.197 255.10.in-addr.arpa A 1 > 2025-04-18T18:44:23 > > RCodeRefused > > 10.0.62.244 0.10.in-addr.arpa A 2 > 2025-04-18T18:44:22 > > RCodeRefused > > 10.0.62.244 10.10.in-addr.arpa A 3 > 2025-04-18T18:44:25 > > RCodeRefused > > 10.0.62.244 255.10.in-addr.arpa A 2 > 2025-04-18T18:44:23 > > RCodeRefused > > dump-throttlemap: dumped 6 records > > Looking at your config below, You are forwarding to servers that do not > want to answers those queries. Make sure you either do not forward or > change the auths to respond properly. "Refused" means the auth does > not have the particular zone. An auth responding Refused on a lot of > queries will be throttled for those specific queries. > > > > > # rec_control dump-failedservers - > > I removed any count 1 or 2 for brevity since this email is already a long > > read. > > ; failed servers dump follows > > ; remote IP count timestamp > > 203.119.25.5 8 2025-04-18T18:43:44 > > 203.119.26.5 8 2025-04-18T18:43:42 > > 203.119.27.5 8 2025-04-18T18:43:41 > > 203.119.28.5 8 2025-04-18T18:43:39 > > 203.119.29.5 8 2025-04-18T18:43:45 > > 200.189.41.10 7 2025-04-18T18:42:46 > > 200.219.148.10 6 2025-04-18T18:39:47 > > 200.219.154.10 6 2025-04-18T18:42:43 > > 200.219.159.10 7 2025-04-18T18:42:45 > > 200.192.233.10 7 2025-04-18T18:42:40 > > 200.229.248.10 4 2025-04-18T18:42:42 > > 203.119.95.53 3 2025-04-18T18:39:30 > > 203.119.86.101 1229 2025-04-18T18:40:03 > > 35.173.255.124 4895 2025-04-18T18:36:21 > > dump-failedservers: dumped 43 records > > Depending on how long your recursor is running, some of these counts > are pretty high. This *might* indicate connectivity issues, but no > defnite conclusion, some network trouble shooting might be in place > esepcially as 203.119.86.101 is ns3.apnic.net, which *should* be a > server that's reachable and responding properly. 35.173.255.124 looks > like a random aws IP. > > > > > > > Config(s) > > > > Please note that one of the zones forwarding is 'split brained' from a > > legacy setup. The zone consists of a private Active Directory environment > > and a separately maintained public zone. The configuration forwards to > the > > private AD servers and I believe the lua script drops queries that have > no > > match in that zone. The public zone is being slowly phased out. > > > > I noted while reviewing the previous server configs and found a comment > > about this value but no context for the specific reasoning. This may > > explain the values you noted but I would like to understand the > > implications of removing it. It doesn't seem like something that should > > have been enabled. > > # https://github.com/PowerDNS/pdns/issues/6186 > > max-negative-ttl=0 > > That is indeed potentially killing performance. Better leave it at the > default, unless you have very specific reasons to change it. In > practise any DNS server spends quite a lot of it's time answering > negatively. Not caching negative answer will cause quite a lot of work > since the recursor will need to contacts auths for each client query > that will lead to a negative answer again and again. > > A common cause to dislike negative caching is (for a name in a locally > managed zone): > > 1. Query rec for a name and see that it does not exist (NODATA answer) > 2. Modify the auth zone so the name exists > 3. Query again and see that it still does not exist because of negative > caching in rec. > > The answer to this is not to "disable negative chaching". The proper > answer is: avoid the initial query, have some patience or flush the > rec cache for that name by using rec_control or sending rec a notify > (notify rec is a relative new feature, and needs to be set up to allow > it, see > > https://docs.powerdns.com/recursor/yamlsettings.html#incoming-allow-notify-from > ). > > > > > /etc/pdns-recursor/recursor.conf > > > > --- > > > > dnssec: > > > > validation: validate > > > > incoming: > > > > allow_from: > > > > - 127.0.0.1/8 > > > > - 10.0.0.0/8 > > > > - 172.16.0.0/12 > > > > - 192.168.0.0/16 > > > > - 'fd00::/8' > > > > - '2607:B600::/32' > > > > listen: > > > > - 0.0.0.0 > > > > max_tcp_clients: 128 > > > > max_tcp_per_client: 0 > > > > max_tcp_queries_per_connection: 0 > > > > port: 53 > > > > tcp_timeout: 2 > > > > outgoing: > > > > dont_query: [] > > > > max_qperq: 50 > > > > network_timeout: 1500 > > > > packetcache: > > > > max_entries: 1000000 > > > > recordcache: > > > > max_entries: 1000000 > > > > max_negative_ttl: 0 > > > > max_ttl: 86400 > > > > recursor: > > > > daemon: false > > > > forward_zones: > > > > - zone: momentumbusiness.com > > > > recurse: false > > > > forwarders: > > > > - 10.255.255.76 > > > > - 10.1.3.228 > > > > - zone: 10.in-addr.arpa > > > > recurse: false > > > > forwarders: > > > > - 10.0.196.197 > > > > - 10.0.62.244 > > > > - zone: 168.192.in-addr.arpa > > > > recurse: false > > > > forwarders: > > > > - 10.0.196.197 > > > > - 10.0.62.244 > > > > - zone: 16.172.in-addr.arpa > > > > recurse: false > > > > forwarders: > > > > - 10.0.196.197 > > > > - 10.0.62.244 > > > > lua_dns_script: /etc/pdns-recursor/momentumbusiness_com.lua > > > > max_recursion_depth: 40 > > > > max_total_msec: 7000 > > > > minimum_ttl_override: 1 > > > > server_id: nsres01.momentumtelecom.com > > > > setgid: pdns-recursor > > > > setuid: pdns-recursor > > > > webservice: > > > > address: 0.0.0.0 > > > > allow_from: > > > > - 192.168.9.164 > > > > - 192.168.21.134 > > > > - 192.168.20.0/24 > > > > api_key: <sanitized> > > > > port: 8080 > > > > webserver: true > > > > logging: > > > > loglevel: 3 > > > > ... > > > > /etc/pdns-recursor/momentumbusiness_com.lua > > pdnslog("Lua NXDomain filter for momentumbusiness.com loading...", > > pdns.loglevels.Notice) > > nxdomainsuffix=newDN("momentumbusiness.com") > > function nxdomain(dq) > > if dq.qname:isPartOf(nxdomainsuffix) > > then > > dq.appliedPolicy.policyKind = pdns.policykinds.Drop > > return true > > end > > return false > > end > > I do wonder what's the purpose of this special nxdoamin handling is. A > drop is not nice to clients, as the query will timeout out from their > perspective. Maybe pdns.policykinds.NODATA or just leaving the special > handling out? > > > > > On Fri, Apr 18, 2025 at 9:39 AM Otto Moerbeek <o...@drijf.net> wrote: > > > > > On Fri, Apr 18, 2025 at 08:28:48AM -0400, Scott Crace via Pdns-users > wrote: > > > > > > Hi, > > > > > > Please include your config. That said: > > > > > > You seem to have pretty low cache hit ratio, a high number of outgoing > > > queries. How is your cache configged? > > > > > > Also some throttling is going on. I suspect rec has trouble contacting > > > one or more auths or forwarders. The throttling tables can be viewed > > > using > > > > > > rec_control dump-throttlemap - > > > rec_control dump-failedservers - > > > > > > Also, what happens *during* the trace can be very relevant. If one > > > auth (or forwarder) does not respond, rec will turn to another one, > > > but only after the timeout of 1500ms by default. > > > > > > -Otto > > > > > > > Hello all, > > > > Long time lurker on the message list and would like some performance > > > > and/or tuning advice. > > > > We've been using pdns-recursor as internal recursive nameservers for > > > quite > > > > some time now. > > > > The original implementer of pdns departed and I was recently tasked > with > > > > replacing or upgrading all of the servers with newer RHEL9 versions. > I > > > > opted to build fresh and migrate the configuration to the latest 5.2 > > > > release. > > > > > > > > I'm hearing occasional complaints about odd issues and/or clients > cycling > > > > through their DNS servers rapidly (timeouts?). Manual testing DNS > works > > > but > > > > I am reading through the metrics and performance documentation. I am > > > hoping > > > > someone with a more experienced eye could take a look at a sampling > of > > > the > > > > periodic statistics report (below) and provide some insight or > > > > prioritization on any urgent issues I should focus on studying first. > > > > > > > > My observations: > > > > * I do note that the performance documentation talks about > > > > firewalld/stateful firewalls impact but the legacy servers were > using the > > > > same basic setup. If the firewall is the problem is there a way to > > > validate > > > > this (other than stopping firewalld and waiting)? > > > > * The "worker" threads seem evenly distributed to my novice eye and > our > > > qps > > > > (queries per second) rate is low as I would expect since the name > servers > > > > are internal only resources. > > > > * I ran a few pcaps and rec_control trace-regex for specific domain > items > > > > being reported as problematic. Everything seemed to be working with > the > > > > trace-regex always showing "Step3 Final resolve: No Error/6 or 8". > > > > > > > > Thank you in advance for your time and consideration. > > > > > > > > Sincerely, > > > > Scotsie > > > > > > > > ``` > > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic > > > statistics > > > > report" subsystem="stats" level="0" prio="Info" tid="0" > > > ts="1744920448.170" > > > > cache-entries="23666" negcache-entries="497" questions="6831695" > > > > record-cache-acquired="286931329" record-cache-contended="64414" > > > > record-cache-contended-perc="0.02" record-cache-hitratio-perc="0.87" > > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic > > > statistics > > > > report" subsystem="stats" level="0" prio="Info" tid="0" > > > ts="1744920448.170" > > > > packetcache-acquired="16887684" packetcache-contended="1019" > > > > packetcache-contended-perc="0.01" packetcache-entries="7112" > > > > packetcache-hitratio-perc="37.75" > > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic > > > statistics > > > > report" subsystem="stats" level="0" prio="Info" tid="0" > > > ts="1744920448.170" > > > > edns-entries="38" failed-host-entries="50" > > > > non-resolving-nameserver-entries="0" nsspeed-entries="968" > > > > saved-parent-ns-sets-entries="65" throttle-entries="8" > > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic > > > statistics > > > > report" subsystem="stats" level="0" prio="Info" tid="0" > > > ts="1744920448.170" > > > > concurrent-queries="1" dot-outqueries="0" idle-tcpout-connections="0" > > > > outgoing-timeouts="36594" outqueries="14668546" > > > > outqueries-per-query-perc="214.71" tcp-outqueries="3131" > > > > throttled-queries-perc="1.90" > > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic > > > statistics > > > > report" subsystem="stats" level="0" prio="Info" tid="0" > > > ts="1744920448.170" > > > > taskqueue-expired="0" taskqueue-pushed="540" taskqueue-size="0" > > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Queries > handled by > > > > thread" subsystem="stats" level="0" prio="Info" tid="0" > > > ts="1744920448.170" > > > > count="3470098" thread="0" tname="worker" > > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Queries > handled by > > > > thread" subsystem="stats" level="0" prio="Info" tid="0" > > > ts="1744920448.170" > > > > count="3360836" thread="1" tname="worker" > > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Queries > handled by > > > > thread" subsystem="stats" level="0" prio="Info" tid="0" > > > ts="1744920448.171" > > > > count="764" thread="2" tname="tcpworker" > > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic QPS > > > report" > > > > subsystem="stats" level="0" prio="Info" tid="0" ts="1744920448.171" > > > > averagedOver="1800" qps="117" > > > > ``` > > > > > > > _______________________________________________ > > > > Pdns-users mailing list > > > > Pdns-users@mailman.powerdns.com > > > > https://mailman.powerdns.com/mailman/listinfo/pdns-users > > > > > > >
_______________________________________________ Pdns-users mailing list Pdns-users@mailman.powerdns.com https://mailman.powerdns.com/mailman/listinfo/pdns-users