Hello all, I’m experiencing a performance degradation while using the Recursor
that I haven’t been able to root cause, and I was hoping to get some insight on
what might be causing it, or some troubleshooting and diagnostics tips.
The observed issue is that the Recursor would start up, begin receiving and
forwarding traffic for several minutes, suddenly the CPU usage would jump to
100%, rec_control started timing out attempting to talk to the program, and
Prometheus metrics scrapes would stop succeeding.
The Recursor also stopped reporting statistics at the
logging.statistics_interval, although if I added a debug Lua script it would
still log messages from the threads.
I’m running the Recursor as a container in Kubernetes with the 5.0.3 image from
Docker Hub. The replicas receive anywhere from 2-6k QPS nominally, but this
issue seems to occur regardless of query rate. Before the metrics freeze, the
instance only uses 10% of its allocated CPU and about 20% of allocated memory,
and then gets pegged to 100% CPU usage and stops responding. This happens
anywhere from several minutes to an hour after process start.
Below is the configuration used. We’re using this as a middleman for now to
forward requests upstream to larger recursors and cache the results, and is
only accessible from within our network.
recursor:
forward_zones_recurse:
- zone: .
forwarders:
- 8.8.8.8
- 1.1.1.1
- 8.8.4.4
- 1.0.0.1
- 208.67.222.222
- 208.67.220.220
threads: 8
security_poll_suffix: ''
server_id: ''
version_string: ''
any_to_tcp: true
dnssec:
validation: process-no-validate
incoming:
allow_from:
- 0.0.0.0/0
listen:
- 0.0.0.0
port: 2353
logging:
disable_syslog: true
recordcache:
max_entries: 500000
webservice:
address: 0.0.0.0
allow_from:
- 10.0.0.0/8
- 172.16.0.0/12
- 127.0.0.1
loglevel: none
port: 8912
webserver: true
While attempting to mitigate this live, I took a variety of actions in an
attempt to root cause the issue, but none of the actions made a difference and
we observed the same failure mode each time.
Actions taken:
- Disabling DNSSEC entirely with dnssec.validation: off
- Reducing threads from 8 -> 4, we had this scaled up to 8 as the default 2
would lead to a high volume of UdpInErrors after some time
- Turned off TCP, only receiving queries over UDP
- Enabled logging.common_errors to see if there was anything interesting, but
nothing notable happened around the time of freeze
Notably, rec_control would report the error: “Timeout waiting for control
channel data”, and when I attempted to turn on trace logging for SERVFAILs
(logging.trace: fail), logs indicated “Could not dup trace file” with reason
“Bad file descriptor”.
Let me know if there’s any other information I can provide. I haven’t been able
to correlate this behavior to any unusual metrics before the freeze as
everything appeared reasonably nominal, but I’m happy to provide the last
Prometheus scrape we got if it would help. Any ideas?
Thank you!
Tim
_______________________________________________
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users