Hello,
It may or may not be relevant, but it sounds similar to a problem we had
to solve a few months ago. Try the following query analysis - monitor
the number of recursive queries in a given moment, and when it exceeds a
certain threshold, send "rndc recursing" to Bind and have a look on the
queries. Basically, we have find out there is and ongoing attack
originating from China that has the following structure - a number of
bogus domains is registrered, like "345qp.com.cn", etc, then target
nameservers are listed as authoritative for it, and vast botnets of
infected home routers/modems are told to send bogus queries for the
domain. Your resolvers will start having problems you describe when the
admin of the attacked authoritative servers realizes what's going on and
stops responding to queries to these domains. That means your resolvers
have to wait for timeout of each and everyone of these bogus queries
which in the meantime blocks an amount of memory and processing time,
and it adds up rather quickly, potentially overwhelming your hardware
(basically, it's a huge abnormal peak contrasting with normal operation)
The solution we chose is that we identify these bogus queries (they
vastly outnumber legitimate queries), and we decide to sort of
"blacklist" the given bogus domain for an amount of time in the sense
that we no longer do a recursive query for the client, but we
immediately and authoritatively answer NXDOMAIN for the query. While it
is a deviation from the correct behavior, it conservers the resources of
the resolver, because an immediate authoritative answer takes fraction
of time, memory and cpu to resolve. False positives are of course
possible, but with some degree of monitoring and whitelisting
problematic domains (like google.com, yahoo.com, etc.), they can be
rather rare.
Hope this helps, don't hesitate to ask me for details. I think it maybe
relevant to your situation.
--
Best Regards,
Daniel Ryšlink
System Administrator
Dial Telecom a. s.
Křižíkova 36a/237
186 00 Praha 3, Česká Republika
Tel.:+420.226204627
daniel.rysl...@dialtelecom.cz
-----------------------------------------------
www.dialtelecom.cz
Dial Telecom, a.s.
Jednoduše se připojte
-----------------------------------------------
On 11/24/2014 12:37 PM, Niall O'Reilly wrote:
At Sun, 23 Nov 2014 21:00:15 -0800 (PST),
blrmaani wrote:
Our nameservers take upto 10KQPS (mostly NOERROR type most of the time).
Twice or thrice a week, I have seen upto 10% of the queries are
SERVFAIL and we have started exceeding the default value of 2000 for
recursive-clients settings in BIND 9.9.x.
Is there a recommended value for recursive-clients option assuming
huge number of SERVFAIL queries once in a 2/3 days?
I'm not convinced to increase it to some arbitrary huge number
20,000 or 200,000.
I am looking for answer like - if your peak SERVFAIL queries are
2000/second, then your recursive-clients value should be N.
I wouldn't expect that such an answer could make sense.
Exhaustion of the active recursive-clients list and the generation
of responses marked SERVFAIL are most likely different symptoms of
the same problem. I think you'll need to identify this problem and
then determine what action to take.
Your resolver seems to be dealing with queries which are
unanswerable and which are arriving in a quantity sufficient to fill
the recursive-clients list. This may be due to rogue clients,
misconfigured authoritative servers, network problems, or some
combination of these. Your logs will help identify which.
I hope this helps.
Niall O'Reilly
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe
from this list
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe
from this list
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users