On Fri, Mar 15, 2024 at 05:25:20PM +0100, Otto Moerbeek via Pdns-users wrote:
> > Op 15 mrt. 2024, om 17:01 heeft Tim Burns via Pdns-users > > <pdns-users@mailman.powerdns.com> het volgende geschreven: > > > > Hello all, I’m experiencing a performance degradation while using the > > Recursor that I haven’t been able to root cause, and I was hoping to get > > some insight on what might be causing it, or some troubleshooting and > > diagnostics tips. > > > > The observed issue is that the Recursor would start up, begin receiving and > > forwarding traffic for several minutes, suddenly the CPU usage would jump > > to 100%, rec_control started timing out attempting to talk to the program, > > and Prometheus metrics scrapes would stop succeeding. > > The Recursor also stopped reporting statistics at the > > logging.statistics_interval, although if I added a debug Lua script it > > would still log messages from the threads. > > > > I’m running the Recursor as a container in Kubernetes with the 5.0.3 image > > from Docker Hub. The replicas receive anywhere from 2-6k QPS nominally, but > > this issue seems to occur regardless of query rate. Before the metrics > > freeze, the instance only uses 10% of its allocated CPU and about 20% of > > allocated memory, and then gets pegged to 100% CPU usage and stops > > responding. This happens anywhere from several minutes to an hour after > > process start. > > > > Below is the configuration used. We’re using this as a middleman for now to > > forward requests upstream to larger recursors and cache the results, and is > > only accessible from within our network. > > > > recursor: > > forward_zones_recurse: > > - zone: . > > forwarders: > > - 8.8.8.8 > > - 1.1.1.1 > > - 8.8.4.4 > > - 1.0.0.1 > > - 208.67.222.222 > > - 208.67.220.220 > > threads: 8 > > security_poll_suffix: '' > > server_id: '' > > version_string: '' > > any_to_tcp: true > > dnssec: > > validation: process-no-validate > > incoming: > > allow_from: > > - 0.0.0.0/0 > > listen: > > - 0.0.0.0 > > port: 2353 > > logging: > > disable_syslog: true > > recordcache: > > max_entries: 500000 > > webservice: > > address: 0.0.0.0 > > allow_from: > > - 10.0.0.0/8 > > - 172.16.0.0/12 > > - 127.0.0.1 > > loglevel: none > > port: 8912 > > webserver: true > > > > > > > > While attempting to mitigate this live, I took a variety of actions in an > > attempt to root cause the issue, but none of the actions made a difference > > and we observed the same failure mode each time. > > Actions taken: > > - Disabling DNSSEC entirely with dnssec.validation: off > > - Reducing threads from 8 -> 4, we had this scaled up to 8 as the default 2 > > would lead to a high volume of UdpInErrors after some time > > - Turned off TCP, only receiving queries over UDP > > - Enabled logging.common_errors to see if there was anything interesting, > > but nothing notable happened around the time of freeze > > > > Notably, rec_control would report the error: “Timeout waiting for control > > channel data”, and when I attempted to turn on trace logging for SERVFAILs > > (logging.trace: fail), logs indicated “Could not dup trace file” with > > reason “Bad file descriptor”. > > > > Let me know if there’s any other information I can provide. I haven’t been > > able to correlate this behavior to any unusual metrics before the freeze as > > everything appeared reasonably nominal, but I’m happy to provide the last > > Prometheus scrape we got if it would help. Any ideas? > > > > Thank you! > > Tim > > Try running when the recursor is peggef > > perf top > > Or attach gdb and run > > thread apply all bt > > on gdb's command line > > This might not possible in a container (I don’t know that), but you might be > able to install the perf tools or gdb in the container. Or otherwise try to > reproduce in an ordinary VM or bare metal. > > -Otto > Oh, and some questions: in addition to collecting Prometheus metrics, are you peridodically querying using rec_control or the web server/web-api for other things? If so what exactly? -Otto _______________________________________________ Pdns-users mailing list Pdns-users@mailman.powerdns.com https://mailman.powerdns.com/mailman/listinfo/pdns-users