Hi BIND,

I’ve been trying to track down the source of random latency in our production 
servers, without much luck. At random intervals - several times an hour - named 
appears to suddenly stop processing queries for around 0-2500ms, only to resume 
moments later. This of course introduces latency in response times.

When the glitch occurs, we can watch the rx_queue in /proc/net/udp fill up, so, 
the kernel network stack up to named’s socket buffer has more or less been 
ruled out because packets are coming in with no issues (packet traces 
collaborate with this as well).

Simultaneously, named’s CPU usage drops to 0%, and a stack trace captured at 
that moment looks identical to an idle server. This seems to suggest that the 
issue is likely not inside of named. It’s as if named isn’t getting notified 
about the new packets, but I’m not able to find any known issues with epoll, 
and this could be a “red herring” anyway.

Other bits of info that might be relevant:
* We’ve updated to BIND 9.9.7 with no effect.
* The OS is RHEL 6.6; we just updated the kernel to 2.6.32-504.16.2.el6.x86_64, 
also with no effect.
* The issue is vaguely load dependent, although it’s not clear what kind of 
load, as we haven’t yet been able to reproduce it in a dev environment.
* That being said, our load does not seem at all high. Generally < 5000 QPS, 
load average < 0.1, > 90% idle CPU.
* Nothing stands out in logs from trace 3 / querylog, except, perhaps, the fact 
that there are never any logs at all during the glitch.
* Here is a typical stack trace during the glitch: 
http://pastebin.com/raw.php?i=JZhrPSFv

Anyone have any thoughts about what to look at next?

Thanks in advance,

Mathew Eis

_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Reply via email to