Otto,
   It took me a while to come back to this but I made changes as per your
suggestions shortly after your last reply.
- I reverted the max-negative-ttl to default. Performance seems markedly
improved.
- I removed the lua so no drops will occur and many server clients seem
much happier.
- I've begun collecting the metrics available by the API and graphing them
to watch for trending patterns.

I mostly just wanted to say thank you for the support and I will start a
new thread should I need further assistance in the future.

Sincerely,
Scotsie


On Sat, Apr 19, 2025 at 3:29 AM Otto Moerbeek <o...@drijf.net> wrote:

> Remarks inline.
>
> On Fri, Apr 18, 2025 at 07:04:18PM -0400, Scott Crace wrote:
>
> > Otto,
> > Thanks for your assistance.Since these were setup with private IPs I
> wasn't
> > sure how useful the config would be however, I have included it below.
> >
> > # rec_control dump-throttlemap -
> > ; throttle map dump follows
> > ; remote IP     qname   qtype   count   ttd     reason
> > 10.0.196.197    0.10.in-addr.arpa       A       2
>  2025-04-18T18:44:22
> >     RCodeRefused
> > 10.0.196.197    10.10.in-addr.arpa      A       3
>  2025-04-18T18:44:25
> >     RCodeRefused
> > 10.0.196.197    255.10.in-addr.arpa     A       1
>  2025-04-18T18:44:23
> >     RCodeRefused
> > 10.0.62.244     0.10.in-addr.arpa       A       2
>  2025-04-18T18:44:22
> >     RCodeRefused
> > 10.0.62.244     10.10.in-addr.arpa      A       3
>  2025-04-18T18:44:25
> >     RCodeRefused
> > 10.0.62.244     255.10.in-addr.arpa     A       2
>  2025-04-18T18:44:23
> >     RCodeRefused
> > dump-throttlemap: dumped 6 records
>
> Looking at your config below, You are forwarding to servers that do not
> want to answers those queries.  Make sure you either do not forward or
> change the auths to respond properly.  "Refused" means the auth does
> not have the particular zone. An auth responding Refused on a lot of
> queries will be throttled for those specific queries.
>
> >
> > # rec_control dump-failedservers -
> > I removed any count 1 or 2 for brevity since this email is already a long
> > read.
> > ; failed servers dump follows
> > ; remote IP     count   timestamp
> > 203.119.25.5    8       2025-04-18T18:43:44
> > 203.119.26.5    8       2025-04-18T18:43:42
> > 203.119.27.5    8       2025-04-18T18:43:41
> > 203.119.28.5    8       2025-04-18T18:43:39
> > 203.119.29.5    8       2025-04-18T18:43:45
> > 200.189.41.10   7       2025-04-18T18:42:46
> > 200.219.148.10  6       2025-04-18T18:39:47
> > 200.219.154.10  6       2025-04-18T18:42:43
> > 200.219.159.10  7       2025-04-18T18:42:45
> > 200.192.233.10  7       2025-04-18T18:42:40
> > 200.229.248.10  4       2025-04-18T18:42:42
> > 203.119.95.53   3       2025-04-18T18:39:30
> > 203.119.86.101  1229    2025-04-18T18:40:03
> > 35.173.255.124  4895    2025-04-18T18:36:21
> > dump-failedservers: dumped 43 records
>
> Depending on how long your recursor is running, some of these counts
> are pretty high. This *might* indicate connectivity issues, but no
> defnite conclusion, some network trouble shooting might be in place
> esepcially as 203.119.86.101 is ns3.apnic.net, which *should* be a
> server that's reachable and responding properly. 35.173.255.124 looks
> like a random aws IP.
>
> >
> >
> > Config(s)
> >
> > Please note that one of the zones forwarding is 'split brained' from a
> > legacy setup. The zone consists of a private Active Directory environment
> > and a separately maintained public zone. The configuration forwards to
> the
> > private AD servers and I believe the lua script drops queries that have
> no
> > match in that zone. The public zone is being slowly phased out.
> >
> > I noted while reviewing the previous server configs and found a comment
> > about this value but no context for the specific reasoning. This may
> > explain the values you noted but I would like to understand the
> > implications of removing it. It doesn't seem like something that should
> > have been enabled.
> > # https://github.com/PowerDNS/pdns/issues/6186
> > max-negative-ttl=0
>
> That is indeed potentially killing performance. Better leave it at the
> default, unless you have very specific reasons to change it.  In
> practise any DNS server spends quite a lot of it's time answering
> negatively. Not caching negative answer will cause quite a lot of work
> since the recursor will need to contacts auths for each client query
> that will lead to a negative answer again and again.
>
> A common cause to dislike negative caching is (for a name in a locally
> managed zone):
>
> 1. Query rec for a name and see that it does not exist (NODATA answer)
> 2. Modify the auth zone so the name exists
> 3. Query again and see that it still does not exist because of negative
>    caching in rec.
>
> The answer to this is not to "disable negative chaching". The proper
> answer is: avoid the initial query, have some patience or flush the
> rec cache for that name by using rec_control or sending rec a notify
> (notify rec is a relative new feature, and needs to be set up to allow
> it, see
>
> https://docs.powerdns.com/recursor/yamlsettings.html#incoming-allow-notify-from
> ).
>
> >
> >  /etc/pdns-recursor/recursor.conf
> >
> > ---
> >
> > dnssec:
> >
> >   validation: validate
> >
> > incoming:
> >
> >   allow_from:
> >
> >     - 127.0.0.1/8
> >
> >     - 10.0.0.0/8
> >
> >     - 172.16.0.0/12
> >
> >     - 192.168.0.0/16
> >
> >     - 'fd00::/8'
> >
> >     - '2607:B600::/32'
> >
> >   listen:
> >
> >     - 0.0.0.0
> >
> >   max_tcp_clients: 128
> >
> >   max_tcp_per_client: 0
> >
> >   max_tcp_queries_per_connection: 0
> >
> >   port: 53
> >
> >   tcp_timeout: 2
> >
> > outgoing:
> >
> >   dont_query: []
> >
> >   max_qperq: 50
> >
> >   network_timeout: 1500
> >
> > packetcache:
> >
> >   max_entries: 1000000
> >
> > recordcache:
> >
> >   max_entries: 1000000
> >
> >   max_negative_ttl: 0
> >
> >   max_ttl: 86400
> >
> > recursor:
> >
> >   daemon: false
> >
> >   forward_zones:
> >
> >     - zone: momentumbusiness.com
> >
> >       recurse: false
> >
> >       forwarders:
> >
> >         - 10.255.255.76
> >
> >         - 10.1.3.228
> >
> >     - zone: 10.in-addr.arpa
> >
> >       recurse: false
> >
> >       forwarders:
> >
> >         - 10.0.196.197
> >
> >         - 10.0.62.244
> >
> >     - zone: 168.192.in-addr.arpa
> >
> >       recurse: false
> >
> >       forwarders:
> >
> >         - 10.0.196.197
> >
> >         - 10.0.62.244
> >
> >     - zone: 16.172.in-addr.arpa
> >
> >       recurse: false
> >
> >       forwarders:
> >
> >         - 10.0.196.197
> >
> >         - 10.0.62.244
> >
> >   lua_dns_script: /etc/pdns-recursor/momentumbusiness_com.lua
> >
> >   max_recursion_depth: 40
> >
> >   max_total_msec: 7000
> >
> >   minimum_ttl_override: 1
> >
> >   server_id: nsres01.momentumtelecom.com
> >
> >   setgid: pdns-recursor
> >
> >   setuid: pdns-recursor
> >
> > webservice:
> >
> >   address: 0.0.0.0
> >
> >   allow_from:
> >
> >     - 192.168.9.164
> >
> >     - 192.168.21.134
> >
> >     - 192.168.20.0/24
> >
> >   api_key: <sanitized>
> >
> >   port: 8080
> >
> >   webserver: true
> >
> > logging:
> >
> >   loglevel: 3
> >
> > ...
> >
> > /etc/pdns-recursor/momentumbusiness_com.lua
> > pdnslog("Lua NXDomain filter for momentumbusiness.com loading...",
> > pdns.loglevels.Notice)
> > nxdomainsuffix=newDN("momentumbusiness.com")
> > function nxdomain(dq)
> >     if dq.qname:isPartOf(nxdomainsuffix)
> >     then
> >       dq.appliedPolicy.policyKind = pdns.policykinds.Drop
> >       return true
> >     end
> >       return false
> > end
>
> I do wonder what's the purpose of this special nxdoamin handling is. A
> drop is not nice to clients, as the query will timeout out from their
> perspective. Maybe pdns.policykinds.NODATA or just leaving the special
> handling out?
>
> >
> > On Fri, Apr 18, 2025 at 9:39 AM Otto Moerbeek <o...@drijf.net> wrote:
> >
> > > On Fri, Apr 18, 2025 at 08:28:48AM -0400, Scott Crace via Pdns-users
> wrote:
> > >
> > > Hi,
> > >
> > > Please include your config. That said:
> > >
> > > You seem to have pretty low cache hit ratio, a high number of outgoing
> > > queries. How is your cache configged?
> > >
> > > Also some throttling is going on. I suspect rec has trouble contacting
> > > one or more auths or forwarders. The throttling tables can be viewed
> > > using
> > >
> > >         rec_control dump-throttlemap -
> > >         rec_control dump-failedservers -
> > >
> > > Also, what happens *during* the trace can be very relevant. If one
> > > auth (or forwarder) does not respond, rec will turn to another one,
> > > but only after the timeout of 1500ms by default.
> > >
> > >         -Otto
> > >
> > > >  Hello all,
> > > >  Long time lurker on the message list and would like some performance
> > > > and/or tuning advice.
> > > > We've been using pdns-recursor as internal recursive nameservers for
> > > quite
> > > > some time now.
> > > > The original implementer of pdns departed and I was recently tasked
> with
> > > > replacing or upgrading all of the servers with newer RHEL9 versions.
> I
> > > > opted to build fresh and migrate the configuration to the latest 5.2
> > > > release.
> > > >
> > > > I'm hearing occasional complaints about odd issues and/or clients
> cycling
> > > > through their DNS servers rapidly (timeouts?). Manual testing DNS
> works
> > > but
> > > > I am reading through the metrics and performance documentation. I am
> > > hoping
> > > > someone with a more experienced eye could take a look at a sampling
> of
> > > the
> > > > periodic statistics report (below) and provide some insight or
> > > > prioritization on any urgent issues I should focus on studying first.
> > > >
> > > > My observations:
> > > > * I do note that the performance documentation talks about
> > > > firewalld/stateful firewalls impact but the legacy servers were
> using the
> > > > same basic setup. If the firewall is the problem is there a way to
> > > validate
> > > > this (other than stopping firewalld and waiting)?
> > > > * The "worker" threads seem evenly distributed to my novice eye and
> our
> > > qps
> > > > (queries per second) rate is low as I would expect since the name
> servers
> > > > are internal only resources.
> > > > * I ran a few pcaps and rec_control trace-regex for specific domain
> items
> > > > being reported as problematic. Everything seemed to be working with
> the
> > > > trace-regex always showing "Step3 Final resolve: No Error/6 or 8".
> > > >
> > > > Thank you in advance for your time and consideration.
> > > >
> > > > Sincerely,
> > > > Scotsie
> > > >
> > > > ```
> > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic
> > > statistics
> > > > report" subsystem="stats" level="0" prio="Info" tid="0"
> > > ts="1744920448.170"
> > > > cache-entries="23666" negcache-entries="497" questions="6831695"
> > > > record-cache-acquired="286931329" record-cache-contended="64414"
> > > > record-cache-contended-perc="0.02" record-cache-hitratio-perc="0.87"
> > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic
> > > statistics
> > > > report" subsystem="stats" level="0" prio="Info" tid="0"
> > > ts="1744920448.170"
> > > > packetcache-acquired="16887684" packetcache-contended="1019"
> > > > packetcache-contended-perc="0.01" packetcache-entries="7112"
> > > > packetcache-hitratio-perc="37.75"
> > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic
> > > statistics
> > > > report" subsystem="stats" level="0" prio="Info" tid="0"
> > > ts="1744920448.170"
> > > > edns-entries="38" failed-host-entries="50"
> > > > non-resolving-nameserver-entries="0" nsspeed-entries="968"
> > > > saved-parent-ns-sets-entries="65" throttle-entries="8"
> > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic
> > > statistics
> > > > report" subsystem="stats" level="0" prio="Info" tid="0"
> > > ts="1744920448.170"
> > > > concurrent-queries="1" dot-outqueries="0" idle-tcpout-connections="0"
> > > > outgoing-timeouts="36594" outqueries="14668546"
> > > > outqueries-per-query-perc="214.71" tcp-outqueries="3131"
> > > > throttled-queries-perc="1.90"
> > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic
> > > statistics
> > > > report" subsystem="stats" level="0" prio="Info" tid="0"
> > > ts="1744920448.170"
> > > > taskqueue-expired="0" taskqueue-pushed="540" taskqueue-size="0"
> > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Queries
> handled by
> > > > thread" subsystem="stats" level="0" prio="Info" tid="0"
> > > ts="1744920448.170"
> > > > count="3470098" thread="0" tname="worker"
> > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Queries
> handled by
> > > > thread" subsystem="stats" level="0" prio="Info" tid="0"
> > > ts="1744920448.170"
> > > > count="3360836" thread="1" tname="worker"
> > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Queries
> handled by
> > > > thread" subsystem="stats" level="0" prio="Info" tid="0"
> > > ts="1744920448.171"
> > > > count="764" thread="2" tname="tcpworker"
> > > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic QPS
> > > report"
> > > > subsystem="stats" level="0" prio="Info" tid="0" ts="1744920448.171"
> > > > averagedOver="1800" qps="117"
> > > > ```
> > >
> > > > _______________________________________________
> > > > Pdns-users mailing list
> > > > Pdns-users@mailman.powerdns.com
> > > > https://mailman.powerdns.com/mailman/listinfo/pdns-users
> > >
> > >
>
_______________________________________________
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users

Reply via email to