Re: file descriptor exceeds limit

2015-06-18 Thread Matus UHLAR - fantomas

On 17.06.15 22:39, Shawn Zhou wrote:

BIND on my resolvers reaches the max open file limit and I am getting lots
of SERVFAILs
http://pastebin.com/SxRsHLff



After I increased the max-socks (-s 8192) to 8192, I no longer saw the file
limit error from the log anymore; however, I am still many SERVFAILs.


no other errors?


Our resolvers were doing about 15k queries per seconds when this was
happening and those were legit traffic.  I am aware that I am setting
recursive clients to a very high number.  Those resolvers are running on
12-cores cpu and 24G RAM hardware.  cpu utilization was at about 20% and
plenty of RAM left.



I am wondering if I've reached the limit of BIND for the amount of
recursive queries it can serve.  Any other tunings I should try?


maybe changing number of recursive-clients, max-clients-per-query.

Does EDNS work for you? EDNS problems often result to increased number of
TCP queries which slows down resolution ...


By the way, the resolvers are running RHEL 6.x.


precise BIND version would help a bit more... seems RH6.6 contains 9.8.2 but
that may be different for older RH6 versions.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
LSD will make your ECS screen display 16.7 million colors
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: file descriptor exceeds limit

2015-06-18 Thread Cathy Almond
On 18/06/2015 12:00, Matus UHLAR - fantomas wrote:
> On 17.06.15 22:39, Shawn Zhou wrote:
>> BIND on my resolvers reaches the max open file limit and I am getting
>> lots
>> of SERVFAILs
>> http://pastebin.com/SxRsHLff
> 
>> After I increased the max-socks (-s 8192) to 8192, I no longer saw the
>> file
>> limit error from the log anymore; however, I am still many SERVFAILs.
> 
> no other errors?
> 
>> Our resolvers were doing about 15k queries per seconds when this was
>> happening and those were legit traffic.  I am aware that I am setting
>> recursive clients to a very high number.  Those resolvers are running on
>> 12-cores cpu and 24G RAM hardware.  cpu utilization was at about 20% and
>> plenty of RAM left.
> 
>> I am wondering if I've reached the limit of BIND for the amount of
>> recursive queries it can serve.  Any other tunings I should try?
> 
> maybe changing number of recursive-clients, max-clients-per-query.
> 
> Does EDNS work for you? EDNS problems often result to increased number of
> TCP queries which slows down resolution ...
> 
>> By the way, the resolvers are running RHEL 6.x.
> 
> precise BIND version would help a bit more... seems RH6.6 contains 9.8.2
> but
> that may be different for older RH6 versions.
> 
> 

Unless you're running a build with --with-tuning=large (for which there
are a number of caveats around the capacity of the machine etc..), then
you don't really want to have a backlog of recursive clients that
exceeds 3000-3500.  If you're getting that many in your backlog, then as
already highlighted to you, there is Something Wrong going on.

You're probably running into other resource limits that will be what are
causing the SERVFAIL responses you're still seeing despite increasing
the maximum number of sockets that named can use.  I would tune down the
limit to 3000 and allow named to drop the oldest outstanding client
queries when new ones need to be processed.

There is another logging category you can use (query-errors) that can
tell you more, but it's probably not worth it in this instance.

And I have another suggestion for what might be causing your backlog
(apart from problems in the network path between your servers and the
Internet authoritative servers), for which we have some
soon-to-be-released new mitigation features (in 9.10.3):

https://kb.isc.org/article/AA-01178

(this will be updated to reflect the features we will actually include
in the upcoming release - but they're essentially going to be
fetches-per-server and fetches-per-zone along with with improved
logging/stats for both of those)

There's going to be a webinar about both the problem and the mitigations
on July 8th:

https://www.facebook.com/events/100311766979499/

http://goo.gl/Z8idQf

Hoping that this is useful?

Cathy
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: file descriptor exceeds limit

2015-06-18 Thread Mike Hoskins (michoski)
Inline...responding to each of these including Kathy's soon (thanks to the
community for the responses).  Following with interest as we've seen this
for awhile, though we are possibly a special case which I'll describe more
in another response.


On 6/18/15, 7:00 AM, "Matus UHLAR - fantomas"  wrote:

>On 17.06.15 22:39, Shawn Zhou wrote:
>>BIND on my resolvers reaches the max open file limit and I am getting
>>lots
>> of SERVFAILs
>>http://pastebin.com/SxRsHLff
>
>>After I increased the max-socks (-s 8192) to 8192, I no longer saw the
>>file
>> limit error from the log anymore; however, I am still many SERVFAILs.
>
>no other errors?


When we've dug into it (really, the investigation is ongoing) we don't
notice anything "abnormal".  That means there are plenty of things being
logged, but nothing you don't always see in the modern world of broken DNS
servers, firewalls, network path, etc.


>>Our resolvers were doing about 15k queries per seconds when this was
>> happening and those were legit traffic.  I am aware that I am setting
>> recursive clients to a very high number.  Those resolvers are running on
>> 12-cores cpu and 24G RAM hardware.  cpu utilization was at about 20% and
>> plenty of RAM left.
>
>>I am wondering if I've reached the limit of BIND for the amount of
>> recursive queries it can serve.  Any other tunings I should try?
>
>maybe changing number of recursive-clients, max-clients-per-query.


Have tweaked all these repeatedly, first following community best practice
and then going for the sky (big iron) just to see what impact it had.
None really.


>Does EDNS work for you? EDNS problems often result to increased number of
>TCP queries which slows down resolution ...


Yeah, works fine and passes all tests (manual digs, OARC, etc).


>
>> By the way, the resolvers are running RHEL 6.x.
>
>precise BIND version would help a bit more... seems RH6.6 contains 9.8.2
>but
>that may be different for older RH6 versions.


We're running centos 6.x, but use the latest BIND 9.9.x releases.

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: file descriptor exceeds limit

2015-06-18 Thread Stuart Browne

Just wondering.  You mention you're using RHEL6; are you also getting messages 
in 'dmesg' about connection tracking tables being full?  You may need some 
'NOTRACK' rules in your iptables.


STUART BROWNE
Senior Unix Administrator, Network Administrator, Database Admin
P   +61 9866 3710

www.bomboratech.com.au
Follow us on https://twitter.com/BomboraTech

The Bombora Technologies group of companies includes AusRegistry, ARI Registry 
Services, AusRegistry International and ZOAK Solutions.

The information contained in this communication is intended for the named 
recipients only. It is subject to copyright and may contain legally privileged 
and confidential information and if you are not an intended recipient you must 
not use, copy, distribute or take any action in reliance on it. If you have 
received this communication in error, please delete all copies from your system 
and notify us immediately.
-Original Message-
From: bind-users-boun...@lists.isc.org 
[mailto:bind-users-boun...@lists.isc.org] On Behalf Of Mike Hoskins (michoski)
Sent: Friday, 19 June 2015 2:28 AM
To: Matus UHLAR - fantomas; bind-users@lists.isc.org
Subject: Re: file descriptor exceeds limit

Inline...responding to each of these including Kathy's soon (thanks to the
community for the responses).  Following with interest as we've seen this
for awhile, though we are possibly a special case which I'll describe more
in another response.


On 6/18/15, 7:00 AM, "Matus UHLAR - fantomas"  wrote:

>On 17.06.15 22:39, Shawn Zhou wrote:
>>BIND on my resolvers reaches the max open file limit and I am getting
>>lots
>> of SERVFAILs
>>http://pastebin.com/SxRsHLff
>
>>After I increased the max-socks (-s 8192) to 8192, I no longer saw the
>>file
>> limit error from the log anymore; however, I am still many SERVFAILs.
>
>no other errors?


When we've dug into it (really, the investigation is ongoing) we don't
notice anything "abnormal".  That means there are plenty of things being
logged, but nothing you don't always see in the modern world of broken DNS
servers, firewalls, network path, etc.


>>Our resolvers were doing about 15k queries per seconds when this was
>> happening and those were legit traffic.  I am aware that I am setting
>> recursive clients to a very high number.  Those resolvers are running on
>> 12-cores cpu and 24G RAM hardware.  cpu utilization was at about 20% and
>> plenty of RAM left.
>
>>I am wondering if I've reached the limit of BIND for the amount of
>> recursive queries it can serve.  Any other tunings I should try?
>
>maybe changing number of recursive-clients, max-clients-per-query.


Have tweaked all these repeatedly, first following community best practice
and then going for the sky (big iron) just to see what impact it had.
None really.


>Does EDNS work for you? EDNS problems often result to increased number of
>TCP queries which slows down resolution ...


Yeah, works fine and passes all tests (manual digs, OARC, etc).


>
>> By the way, the resolvers are running RHEL 6.x.
>
>precise BIND version would help a bit more... seems RH6.6 contains 9.8.2
>but
>that may be different for older RH6 versions.


We're running centos 6.x, but use the latest BIND 9.9.x releases.

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: file descriptor exceeds limit

2015-06-18 Thread Mike Hoskins (michoski)
On 6/18/15, 7:09 PM, "Stuart Browne" 
wrote:


>Just wondering.  You mention you're using RHEL6; are you also getting
>messages in 'dmesg' about connection tracking tables being full?  You may
>need some 'NOTRACK' rules in your iptables.

Just following along, for the record...  On our side, iptables is
completely disabled.  We do that sort of thing upstream on dedicated
firewalls.  Just now getting time to reply to Cathy...more detail on that
there.

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: file descriptor exceeds limit

2015-06-18 Thread Mike Hoskins (michoski)
Inline...

On 6/18/15, 9:22 AM, "Cathy Almond"  wrote:


>On 18/06/2015 12:00, Matus UHLAR - fantomas wrote:
>> On 17.06.15 22:39, Shawn Zhou wrote:
>>> BIND on my resolvers reaches the max open file limit and I am getting
>>> lots
>>> of SERVFAILs
>>> http://pastebin.com/SxRsHLff
>> 
>>> After I increased the max-socks (-s 8192) to 8192, I no longer saw the
>>> file
>>> limit error from the log anymore; however, I am still many SERVFAILs.
>> 
>> no other errors?
>> 
>>> Our resolvers were doing about 15k queries per seconds when this was
>>> happening and those were legit traffic.  I am aware that I am setting
>>> recursive clients to a very high number.  Those resolvers are running
>>>on
>>> 12-cores cpu and 24G RAM hardware.  cpu utilization was at about 20%
>>>and
>>> plenty of RAM left.
>> 
>>> I am wondering if I've reached the limit of BIND for the amount of
>>> recursive queries it can serve.  Any other tunings I should try?
>> 
>> maybe changing number of recursive-clients, max-clients-per-query.
>> 
>> Does EDNS work for you? EDNS problems often result to increased number
>>of
>> TCP queries which slows down resolution ...
>> 
>>> By the way, the resolvers are running RHEL 6.x.
>> 
>> precise BIND version would help a bit more... seems RH6.6 contains 9.8.2
>> but
>> that may be different for older RH6 versions.
>> 
>> 
>
>Unless you're running a build with --with-tuning=large (for which there
>are a number of caveats around the capacity of the machine etc..), then
>you don't really want to have a backlog of recursive clients that
>exceeds 3000-3500.  If you're getting that many in your backlog, then as
>already highlighted to you, there is Something Wrong going on.


We're running --with-tuning=large, but I think we are OK (128GB RAM, 32
cores).  If there are other caveats to be aware of, please share.

For years I kept recursive clients conservatively set (based on some of
your docs, and community comments).  I finally raised it much higher just
to see what would happen (after having to repeatedly explain why blindly
increasing that number wasn't a good thing), and it had no effect one way
or another.  Still got the servfails.

We are in a somewhat unique situation, because we have batch type jobs
generating rules/etc which often purposefully crawl the "bad" parts of the
'Net and in turn generate DNS requests for things which legitimately
return servfail.  However, we were getting increasingly consistent
complaints from users about seeing servfails where they weren't expected.
The biggest thing which helped for us was increasing
DISC_SOCKET_MAXEVENTS.  We're still digging to see if the remaining
servfail reports are genuinely something we can tune around, or a symptom
of the use case.


>You're probably running into other resource limits that will be what are
>causing the SERVFAIL responses you're still seeing despite increasing
>the maximum number of sockets that named can use.  I would tune down the
>limit to 3000 and allow named to drop the oldest outstanding client
>queries when new ones need to be processed.


I'm going to crank this back down in our environments.


>There is another logging category you can use (query-errors) that can
>tell you more, but it's probably not worth it in this instance.
>
>And I have another suggestion for what might be causing your backlog
>(apart from problems in the network path between your servers and the
>Internet authoritative servers), for which we have some
>soon-to-be-released new mitigation features (in 9.10.3):
>
>https://kb.isc.org/article/AA-01178
>
>(this will be updated to reflect the features we will actually include
>in the upcoming release - but they're essentially going to be
>fetches-per-server and fetches-per-zone along with with improved
>logging/stats for both of those)
>
>There's going to be a webinar about both the problem and the mitigations
>on July 8th:
>
>https://www.facebook.com/events/100311766979499/
>
>http://goo.gl/Z8idQf


Looking forward to this.  We've been sticking to 9.9.x (currently running
9.9.7) as an ESV release, but maybe 9.10 makes sense.  Not sure how the
community feels about that?

For the record I've spent a lot of time with our network team looking at
firewall logs, getting packet traces, etc and not found any smoking guns.
We have a perhaps not so unique setup where the caches are in a DMZ, so
clients talk through a firewall, and the DNS servers talk through a
firewall.  I've identified and fixed a number of issues along the
way...enumerating here in case it helps anyone else.

The internal firewall was oversubscribed, and at peak times would reset
connections causing clients to retry which quickly wound up recursive
clients.  Replaced those firewalls, and that specific behavior got a lot
better.

The external firewall was sharing a PAT for all caches, which eventually
exhausted 65k ports.  Can't drop these direct on the 'Net for security
reasons, but now have 1-to-1 NAT per cache and haven't seen this exact
b

dnssec validation issue

2015-06-18 Thread Carl Byington
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I have multiple centos6 boxes running 9.10.2-P1, and almost everything
looks good. However, one box seems to not be doing dnssec validation. It
is possible that this behavior predates the latest updates and I just
never noticed it.

A and B have essentially identical configuration, except that A is the
master for some zones, and B is the slave pulling from A. Other than
that, the /etc/named.conf is identical. A also has ipv6 connectivity,
and B does not. The authoritative side works nicely on both. The
recursive resolver is where the difference shows up.

On A:

dig www.dnssec-failed.org  @localhost
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19813
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 5, ADDITIONAL: 11
;; ANSWER SECTION:
www.dnssec-failed.org.  7178IN  A   68.87.109.242
www.dnssec-failed.org.  7178IN  A   69.252.193.191



On B:
dig www.dnssec-failed.org  @localhost
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 4969
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1


/etc/named.conf:

options {
directory "/var/named";
allow-recursion { "friends"; };
dnssec-enable yes;
dnssec-validation yes;
bindkeys-file "/etc/named.iscdlv.key";
managed-keys-directory "/var/named/dynamic";
listen-on-v6 {any;};
ixfr-from-differences yes;
max-journal-size 2m;
notify yes;
response-policy { zone "rpz.five-ten-sg.com";}
qname-wait-recurse no;
filter--on-v4 yes;
filter- { "brokenv6"; };
rate-limit {
responses-per-second 5;
errors-per-second5;
nxdomains-per-second 40;
qps-scale300;
exempt-clients { "friends"; };
};
};


A is neither master nor slave for dnssec-failed.org, and that domain is
not mentioned in the rpz zone.




-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEARECAAYFAlWDYtAACgkQL6j7milTFsHClQCeLKkTuQYlM4liB0UECG5Z4pui
ujMAnj4wnUWqJj258pIlUFo0IONtkkEP
=/QDW
-END PGP SIGNATURE-


___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: dnssec validation issue

2015-06-18 Thread Mark Andrews

In message <1434674101.18744.119.ca...@ns.five-ten-sg.com>, Carl Byington write
s:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> I have multiple centos6 boxes running 9.10.2-P1, and almost everything
> looks good. However, one box seems to not be doing dnssec validation. It
> is possible that this behavior predates the latest updates and I just
> never noticed it.
> 
> A and B have essentially identical configuration, except that A is the
> master for some zones, and B is the slave pulling from A. Other than
> that, the /etc/named.conf is identical. A also has ipv6 connectivity,
> and B does not. The authoritative side works nicely on both. The
> recursive resolver is where the difference shows up.
> 
> On A:
> 
> dig www.dnssec-failed.org  @localhost
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19813
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 5, ADDITIONAL: 11
> ;; ANSWER SECTION:
> www.dnssec-failed.org.  7178IN  A   68.87.109.242
> www.dnssec-failed.org.  7178IN  A   69.252.193.191
> 
> 
> 
> On B:
> dig www.dnssec-failed.org  @localhost
> ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 4969
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
> 

You don't have any trust anchors active.

To use the keys in "/etc/named.iscdlv.key" set "dnssec-validation auto;"

> /etc/named.conf:
> 
> options {
> directory "/var/named";
> allow-recursion { "friends"; };
> dnssec-enable yes;
> dnssec-validation yes;
> bindkeys-file "/etc/named.iscdlv.key";
> managed-keys-directory "/var/named/dynamic";
> listen-on-v6 {any;};
> ixfr-from-differences yes;
> max-journal-size 2m;
> notify yes;
> response-policy { zone "rpz.five-ten-sg.com";}
> qname-wait-recurse no;
> filter--on-v4 yes;
> filter- { "brokenv6"; };
> rate-limit {
> responses-per-second 5;
> errors-per-second5;
> nxdomains-per-second 40;
> qps-scale300;
> exempt-clients { "friends"; };
> };
> };
> 
> 
> A is neither master nor slave for dnssec-failed.org, and that domain is
> not mentioned in the rpz zone.
> 
> 
> 
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v2.0.14 (GNU/Linux)
> 
> iEYEARECAAYFAlWDYtAACgkQL6j7milTFsHClQCeLKkTuQYlM4liB0UECG5Z4pui
> ujMAnj4wnUWqJj258pIlUFo0IONtkkEP
> =/QDW
> -END PGP SIGNATURE-
> 
> 
> ___
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe
>  from this list
> 
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: dnssec validation issue

2015-06-18 Thread Carl Byington
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Fri, 2015-06-19 at 11:10 +1000, Mark Andrews wrote:

> You don't have any trust anchors active.

> To use the keys in "/etc/named.iscdlv.key" set "dnssec-validation
> auto;"

Thanks!!

New centos rpms at http://www.five-ten-sg.com/mapper/bind with a default
named.conf that should actually work.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEARECAAYFAlWDfboACgkQL6j7milTFsEsYgCcDCJgzbdD4quzkp8tI+hFIsfq
oQAAnRTCvYt4K9t98AjGnruiJqTxAj5y
=DOlX
-END PGP SIGNATURE-


___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: dnssec validation issue

2015-06-18 Thread Eray Aslan
On Thu, Jun 18, 2015 at 07:26:28PM -0700, Carl Byington wrote:
> On Fri, 2015-06-19 at 11:10 +1000, Mark Andrews wrote:
> > To use the keys in "/etc/named.iscdlv.key" set "dnssec-validation
> > auto;"
> New centos rpms at http://www.five-ten-sg.com/mapper/bind with a default
> named.conf that should actually work.

With the root zone and most TLDs signed, I do not think it makes sense
to use DLV anymore.  While a typical DNSSEC resolver configuration has
DLV enabled, I personally make the effort to disable it.

-- 
Eray
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users