Re: [Pdns-users] Recursor getting pegged at 100% CPU

Otto Moerbeek via Pdns-users Fri, 15 Mar 2024 09:41:02 -0700

On Fri, Mar 15, 2024 at 05:25:20PM +0100, Otto Moerbeek via Pdns-users wrote:


> > Op 15 mrt. 2024, om 17:01 heeft Tim Burns via Pdns-users 
> > <pdns-users@mailman.powerdns.com> het volgende geschreven:
> > 
> > Hello all, I’m experiencing a performance degradation while using the 
> > Recursor that I haven’t been able to root cause, and I was hoping to get 
> > some insight on what might be causing it, or some troubleshooting and 
> > diagnostics tips.
> >  
> > The observed issue is that the Recursor would start up, begin receiving and 
> > forwarding traffic for several minutes, suddenly the CPU usage would jump 
> > to 100%, rec_control started timing out attempting to talk to the program, 
> > and Prometheus metrics scrapes would stop succeeding.
> > The Recursor also stopped reporting statistics at the 
> > logging.statistics_interval, although if I added a debug Lua script it 
> > would still log messages from the threads.
> >  
> > I’m running the Recursor as a container in Kubernetes with the 5.0.3 image 
> > from Docker Hub. The replicas receive anywhere from 2-6k QPS nominally, but 
> > this issue seems to occur regardless of query rate. Before the metrics 
> > freeze, the instance only uses 10% of its allocated CPU and about 20% of 
> > allocated memory, and then gets pegged to 100% CPU usage and stops 
> > responding. This happens anywhere from several minutes to an hour after 
> > process start.
> >  
> > Below is the configuration used. We’re using this as a middleman for now to 
> > forward requests upstream to larger recursors and cache the results, and is 
> > only accessible from within our network.
> >  
> > recursor:
> > forward_zones_recurse:
> >     - zone: .
> >       forwarders:
> >         - 8.8.8.8
> >         - 1.1.1.1
> >         - 8.8.4.4
> >         - 1.0.0.1
> >         - 208.67.222.222
> >         - 208.67.220.220
> >   threads: 8
> > security_poll_suffix: ''
> >   server_id: ''
> >   version_string: ''
> >   any_to_tcp: true
> > dnssec:
> >   validation: process-no-validate
> > incoming:
> >   allow_from:
> >     - 0.0.0.0/0
> >   listen:
> >     - 0.0.0.0
> >   port: 2353
> > logging:
> >   disable_syslog: true
> > recordcache:
> >   max_entries: 500000
> > webservice:
> >   address: 0.0.0.0
> >   allow_from:
> >     - 10.0.0.0/8
> >     - 172.16.0.0/12
> >     - 127.0.0.1
> >   loglevel: none
> >   port: 8912
> >   webserver: true
> >  
> >  
> >  
> > While attempting to mitigate this live, I took a variety of actions in an 
> > attempt to root cause the issue, but none of the actions made a difference 
> > and we observed the same failure mode each time.
> > Actions taken:
> > - Disabling DNSSEC entirely with dnssec.validation: off
> > - Reducing threads from 8 -> 4, we had this scaled up to 8 as the default 2 
> > would lead to a high volume of UdpInErrors after some time
> > - Turned off TCP, only receiving queries over UDP
> > - Enabled logging.common_errors to see if there was anything interesting, 
> > but nothing notable happened around the time of freeze
> >  
> > Notably, rec_control would report the error: “Timeout waiting for control 
> > channel data”, and when I attempted to turn on trace logging for SERVFAILs 
> > (logging.trace: fail), logs indicated “Could not dup trace file” with 
> > reason “Bad file descriptor”.
> >  
> > Let me know if there’s any other information I can provide. I haven’t been 
> > able to correlate this behavior to any unusual metrics before the freeze as 
> > everything appeared reasonably nominal, but I’m happy to provide the last 
> > Prometheus scrape we got if it would help. Any ideas?
> >  
> > Thank you!
> > Tim
> 
> Try running when the recursor is peggef
> 
>       perf top
> 
> Or attach gdb and run 
> 
>       thread apply all bt
> 
> on gdb's command line
> 
> This might not possible in a container (I don’t know that), but you might be 
> able to install the perf tools or gdb in the container. Or otherwise try to 
> reproduce in an ordinary VM or bare metal.
> 
>       -Otto
> 

Oh, and some questions: in addition to collecting Prometheus metrics,
are you peridodically querying using rec_control or the web
server/web-api for other things? If so what exactly?

        -Otto
_______________________________________________
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-users

Re: [Pdns-users] Recursor getting pegged at 100% CPU

Reply via email to