Recursive bind becomes unresponsive with high load

2016-03-31 Thread Michael Brunnbauer

hi all,

I am using bind on a server that does massive crawling with a multithreaded 
Java app. This server occasionally has to do lookups for hosts in our local
zone netestate.de - for which it is not authoritative - and those lookups tend
to fail when the load is high (e.g. >1000 recursing clients). This suggests 
some kind of congestion.

I have verified that the authoritative name servers for our local zone are not
hammered with requests from the bind instance in question (adding . to every
hostname is important :-) I also have verified that lookups from the crawlers
for the local zone on the lo interface are not excessive. The problem occurs
even before max-cache-size is reached.

Here is my setup:

max-cache-size 1610612736;
recursive-clients 6000;
minimal-responses yes;

Mar 31 14:04:51 bardolino named[1506]: starting BIND 9.10.3-P2  -t 
/etc/namedroot -u named
Mar 31 14:04:51 bardolino named[1506]: built with '--prefix=/usr/local/bind' 
'--with-openssl=/usr/lib/ssl' '--enable-threads' '--with-tuning=large'
Mar 31 14:04:51 bardolino named[1506]: 

Mar 31 14:04:51 bardolino named[1506]: BIND 9 is maintained by Internet Systems 
Consortium,
Mar 31 14:04:51 bardolino named[1506]: Inc. (ISC), a non-profit 501(c)(3) 
public-benefit
Mar 31 14:04:51 bardolino named[1506]: corporation.  Support and training for 
BIND 9 are
Mar 31 14:04:51 bardolino named[1506]: available at https://www.isc.org/support
Mar 31 14:04:51 bardolino named[1506]: 

Mar 31 14:04:51 bardolino named[1506]: adjusted limit on open files from 65536 
to 1048576
Mar 31 14:04:51 bardolino named[1506]: found 4 CPUs, using 4 worker threads
Mar 31 14:04:51 bardolino named[1506]: using 2 UDP listeners per interface
Mar 31 14:04:51 bardolino named[1506]: using up to 21000 sockets

/etc/resolv.conf:

 domain netestate.de
 nameserver 127.0.0.1
 options timeout:10 attempts:1

The problem also occurs with unchanged options (timeout:5 attempts:2).

I can control the number of DNS-threads of my crawling app and have tested it
with up to ca. 3500 recursing clients which results in a number of queries/s
of the same magnitude. With that setup, lookup errors for the local zone 
occur very often (the TTL for the local zone is 10 minutes).

I would be grateful for advice on where to search or what to adjust.

Here is a statistics dump while running with ca. 1000 recursing clients. A
high number of failing queries may be natural - we have a high number of
chinese link farms in our database.

+++ Statistics Dump +++ (1459439461)
++ Incoming Requests ++
 7329332 QUERY
++ Incoming Queries ++
 7261964 A
1357 NS
   4 CNAME
 635 PTR
   7 MX
   65365 
++ Outgoing Queries ++
[View: default]
15552970 A
2022 NS
  78 CNAME
  30 PTR
   7 MX
   28796 
[View: _bind]
++ Name Server Statistics ++
 7329332 IPv4 requests received
  192360 requests with EDNS(0) received
   4 TCP requests received
 605 auth queries rejected
   1 recursive queries rejected
 7327981 responses sent
   5 truncated responses sent
  192358 responses with EDNS(0) sent
 6063138 queries resulted in successful answer
 6386951 queries resulted in non authoritative answer
  115630 queries resulted in nxrrset
  940424 queries resulted in SERVFAIL
  208183 queries resulted in NXDOMAIN
 6756330 queries caused recursion
   3 duplicate queries received
 348 queries dropped
 606 other query failures
1000 recursing clients
 7328722 UDP queries received
   4 TCP queries received
++ Zone Maintenance Statistics ++
++ Resolver Statistics ++
[Common]
  33 mismatch responses received
 999 UDP queries in progress
   1 TCP queries in progress
[View: default]
15583903 IPv4 queries sent
 6182728 IPv4 responses received
  201626 NXDOMAIN received
   14456 SERVFAIL received
  46 FORMERR received
 138 EDNS(0) query failures
   19648 truncated responses received
 379 lame delegations received
 8550865 query retries
 9401889 query timeouts
   15859 IPv4 NS address fetches
 581 IPv4 NS address fetch failed
  242332 queries with RTT < 10ms
  307416 queries with RTT 10-100ms
 5575709 queries with RTT 100-500ms
   46819 queries with RTT 500-800ms
1560 queries with RTT 800-1600ms
872

Re: Recursive bind becomes unresponsive with high load

2016-03-31 Thread Mike Hoskins (michoski)
If you are crawling lots of new names, the cache size won't have much
impact.  Each new query will require recursing vs hitting the cache.  Try
"rndc recursing" and look at what you have sitting around waiting for
answers.  Hopefully that provides some clues.  This can be all sorts of
things like unresponsive auth servers, network issues, firewalls munging
EDNS, etc causing the recursive client backlog.


On 3/31/16, 11:57 AM, "bind-users-boun...@lists.isc.org on behalf of
Michael Brunnbauer"  wrote:

>
>hi all,
>
>I am using bind on a server that does massive crawling with a
>multithreaded 
>Java app. This server occasionally has to do lookups for hosts in our
>local
>zone netestate.de - for which it is not authoritative - and those lookups
>tend
>to fail when the load is high (e.g. >1000 recursing clients). This
>suggests 
>some kind of congestion.
>
>I have verified that the authoritative name servers for our local zone
>are not
>hammered with requests from the bind instance in question (adding . to
>every
>hostname is important :-) I also have verified that lookups from the
>crawlers
>for the local zone on the lo interface are not excessive. The problem
>occurs
>even before max-cache-size is reached.
>
>Here is my setup:
>
>max-cache-size 1610612736;
>recursive-clients 6000;
>minimal-responses yes;
>
>Mar 31 14:04:51 bardolino named[1506]: starting BIND 9.10.3-P2
> -t /etc/namedroot -u named
>Mar 31 14:04:51 bardolino named[1506]: built with
>'--prefix=/usr/local/bind' '--with-openssl=/usr/lib/ssl'
>'--enable-threads' '--with-tuning=large'
>Mar 31 14:04:51 bardolino named[1506]:
>
>Mar 31 14:04:51 bardolino named[1506]: BIND 9 is maintained by Internet
>Systems Consortium,
>Mar 31 14:04:51 bardolino named[1506]: Inc. (ISC), a non-profit 501(c)(3)
>public-benefit
>Mar 31 14:04:51 bardolino named[1506]: corporation.  Support and training
>for BIND 9 are
>Mar 31 14:04:51 bardolino named[1506]: available at
>https://www.isc.org/support
>Mar 31 14:04:51 bardolino named[1506]:
>
>Mar 31 14:04:51 bardolino named[1506]: adjusted limit on open files from
>65536 to 1048576
>Mar 31 14:04:51 bardolino named[1506]: found 4 CPUs, using 4 worker
>threads
>Mar 31 14:04:51 bardolino named[1506]: using 2 UDP listeners per interface
>Mar 31 14:04:51 bardolino named[1506]: using up to 21000 sockets
>
>/etc/resolv.conf:
>
> domain netestate.de
> nameserver 127.0.0.1
> options timeout:10 attempts:1
>
>The problem also occurs with unchanged options (timeout:5 attempts:2).
>
>I can control the number of DNS-threads of my crawling app and have
>tested it
>with up to ca. 3500 recursing clients which results in a number of
>queries/s
>of the same magnitude. With that setup, lookup errors for the local zone
>occur very often (the TTL for the local zone is 10 minutes).
>
>I would be grateful for advice on where to search or what to adjust.
>
>Here is a statistics dump while running with ca. 1000 recursing clients. A
>high number of failing queries may be natural - we have a high number of
>chinese link farms in our database.
>
>+++ Statistics Dump +++ (1459439461)
>++ Incoming Requests ++
> 7329332 QUERY
>++ Incoming Queries ++
> 7261964 A
>1357 NS
>   4 CNAME
> 635 PTR
>   7 MX
>   65365 
>++ Outgoing Queries ++
>[View: default]
>15552970 A
>2022 NS
>  78 CNAME
>  30 PTR
>   7 MX
>   28796 
>[View: _bind]
>++ Name Server Statistics ++
> 7329332 IPv4 requests received
>  192360 requests with EDNS(0) received
>   4 TCP requests received
> 605 auth queries rejected
>   1 recursive queries rejected
> 7327981 responses sent
>   5 truncated responses sent
>  192358 responses with EDNS(0) sent
> 6063138 queries resulted in successful answer
> 6386951 queries resulted in non authoritative answer
>  115630 queries resulted in nxrrset
>  940424 queries resulted in SERVFAIL
>  208183 queries resulted in NXDOMAIN
> 6756330 queries caused recursion
>   3 duplicate queries received
> 348 queries dropped
> 606 other query failures
>1000 recursing clients
> 7328722 UDP queries received
>   4 TCP queries received
>++ Zone Maintenance Statistics ++
>++ Resolver Statistics ++
>[Common]
>  33 mismatch responses received
> 999 UDP queries in progress
>   1 TCP queries in progress
>[View: default]
>15583903 IPv4 queries sent
> 6182728 IPv4 responses received
>  201626 NXDOMAIN received
>   14

Re: Recursive bind becomes unresponsive with high load

2016-03-31 Thread Tony Finch
Michael Brunnbauer  wrote:
>
> I am using bind on a server that does massive crawling with a multithreaded
> Java app. This server occasionally has to do lookups for hosts in our local
> zone netestate.de - for which it is not authoritative - and those lookups tend
> to fail when the load is high (e.g. >1000 recursing clients). This suggests
> some kind of congestion.

Have you tried adjusting the clients-per-query and max-clients-per-query
options?

Tony.
-- 
f.anthony.n.finchhttp://dotat.at/  -  I xn--zr8h punycode
Trafalgar: Northerly 5 to 7, occasionally gale 8 at first in northeast.
Moderate or rough, occasionally very rough at first in northeast. Showers.
Good.
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Recursive bind becomes unresponsive with high load

2016-03-31 Thread Michael Brunnbauer

Hello Mike,

On Thu, Mar 31, 2016 at 04:05:39PM +, Mike Hoskins (michoski) wrote:
> If you are crawling lots of new names, the cache size won't have much
> impact.  Each new query will require recursing vs hitting the cache.  Try
> "rndc recursing" and look at what you have sitting around waiting for
> answers.  Hopefully that provides some clues.  This can be all sorts of
> things like unresponsive auth servers, network issues, firewalls munging
> EDNS, etc causing the recursive client backlog.

Can a "recursive client backlog" be a problem if recursing clients is ca. 1000
while recursive-clients is 6000? If yes, where is the backlog? I can see it
in the syslog when recursive-clients is reached - this does not happen here.

Here are the first 10 lines. The other 995 lines all look like this.

;
; Recursing Queries
;
; client 127.0.0.1#40278: id 13156 'fnnd0u.ciptdd.cn/A/IN' requesttime 
1459440503
; client 127.0.0.1#43457: id 30082 '6aj344.iqr8aop.cn/A/IN' requesttime 
1459440503
; client 127.0.0.1#55751: id 58170 'g1zdo7.02fucag.cn/A/IN' requesttime 
1459440503
; client 127.0.0.1#38696: id 62912 'v6mzb.566095.top/A/IN' requesttime 
1459440504
; client 127.0.0.1#38585: id 17254 'l3ay0.688903.top/A/IN' requesttime 
1459440504
; client 127.0.0.1#47576: id 24940 '0h8xi.866099.top/A/IN' requesttime 
1459440504
; client 127.0.0.1#38195: id 25054 'oipy2.spwgm89.com/A/IN' requesttime 
1459440504

There are only 2 requests for .de domains in the queue so the failing requests
for netestate.de cannot be explained by a rate limiting of the .de nameservers.
What are current rate limits for tld nameservers anyway? I wonder how fast
a single bind instance should hammer them.

Our database is cluttered with chinese linkfarms and the DNS queries for them
tend to fail early and often or take a long time. I may be able to address
this in some way so that those queries are reduced but I would also like to
have a DNS server that can handle high load and it seems my current setup is
lacking. 

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail bru...@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel


signature.asc
Description: PGP signature
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Recursive bind becomes unresponsive with high load

2016-03-31 Thread Michael Brunnbauer

Hello Tony,

On Thu, Mar 31, 2016 at 05:08:43PM +0100, Tony Finch wrote:
> Michael Brunnbauer  wrote:
> >
> > I am using bind on a server that does massive crawling with a multithreaded
> > Java app. This server occasionally has to do lookups for hosts in our local
> > zone netestate.de - for which it is not authoritative - and those lookups 
> > tend
> > to fail when the load is high (e.g. >1000 recursing clients). This suggests
> > some kind of congestion.
> 
> Have you tried adjusting the clients-per-query and max-clients-per-query
> options?

I just tried 

clients-per-query 250;
max-clients-per-query 1000;

and it did not help. I noticed a drop in the number of recursing clients and
the number of queries/s when a lookup failed for the local zone but named
logged no special messages in the syslog.

Is is possible that is this connected to rndc stats? I will stop doing
rndc stats for a while to test (it currently runs every minute).

BTW: I also excluded packet loss as cause of the problem.

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail bru...@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel


signature.asc
Description: PGP signature
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Recursive bind becomes unresponsive with high load

2016-03-31 Thread sthaug
> > If you are crawling lots of new names, the cache size won't have much
> > impact.  Each new query will require recursing vs hitting the cache.  Try
> > "rndc recursing" and look at what you have sitting around waiting for
> > answers.  Hopefully that provides some clues.  This can be all sorts of
> > things like unresponsive auth servers, network issues, firewalls munging
> > EDNS, etc causing the recursive client backlog.
> 
> Can a "recursive client backlog" be a problem if recursing clients is ca. 1000
> while recursive-clients is 6000? If yes, where is the backlog? I can see it
> in the syslog when recursive-clients is reached - this does not happen here.

Have you checked your operating system limits? One recursive client
often means one open socket (waiting for response from authoritative
server), i.e. one open file descriptor. If you have thousands of
simultaneous recursive clients, you will need a correspondingly large
file descriptor limit for the named process.

Remember that a (presumed) authoritative server which is slow to
answer means that the socket may be held open correspondingly long.

Steinar Haug, Nethelp consulting, sth...@nethelp.no
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: BIND started replying to queries for .com with .COM

2016-03-31 Thread Robert Edmonds
Tony Finch wrote:
> Phil Mayers  wrote:
> >
> > What is considered the source of the ownername for, say, "com."?
> 
> It should be the root zone master file.

Why not the com zone master file?

-- 
Robert Edmonds
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Recursive bind becomes unresponsive with high load

2016-03-31 Thread Michael Brunnbauer

Hello Steinar,

On Thu, Mar 31, 2016 at 07:35:39PM +0200, sth...@nethelp.no wrote:
> Have you checked your operating system limits? One recursive client
> often means one open socket (waiting for response from authoritative
> server), i.e. one open file descriptor. If you have thousands of
> simultaneous recursive clients, you will need a correspondingly large
> file descriptor limit for the named process.
> 
> Remember that a (presumed) authoritative server which is slow to
> answer means that the socket may be held open correspondingly long.

The process can use up to 65536 file descriptors and the failed lookups
also occur in other processes that only do one lookup.

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail bru...@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel


signature.asc
Description: PGP signature
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Recursive bind becomes unresponsive with high load

2016-03-31 Thread Michael Brunnbauer

hi all,

On Thu, Mar 31, 2016 at 07:32:21PM +0200, Michael Brunnbauer wrote:
> Is is possible that is this connected to rndc stats? I will stop doing
> rndc stats for a while to test (it currently runs every minute).

Not doing rndc stats did not prevent the problem. Any other ideas?

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail bru...@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel


signature.asc
Description: PGP signature
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Recursive bind becomes unresponsive with high load

2016-03-31 Thread John Miller
On Thu, Mar 31, 2016 at 2:00 PM, Michael Brunnbauer  wrote:
>
> hi all,
>
> On Thu, Mar 31, 2016 at 07:32:21PM +0200, Michael Brunnbauer wrote:
>> Is is possible that is this connected to rndc stats? I will stop doing
>> rndc stats for a while to test (it currently runs every minute).
>
> Not doing rndc stats did not prevent the problem. Any other ideas?

Hi Michael,

Are you doing query logging on this box? If so, you might check for
messages such as:

named[743]: clients-per-query decreased to 17

I know you tried setting max-clients-per-query earlier, but since this
is for a locally-hosted zone, query volume could be quite high
somewhere along the way.  Likewise, you might run

rndc status

and see what you get.

Something else you might try: if you don't already, turn on
server-statistics/statistics-channels:

https://kb.isc.org/article/AA-00769/0/Using-BINDs-XML-statistics-channels.html

You may get what you're looking for; you may not.

John
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: BIND started replying to queries for .com with .COM

2016-03-31 Thread Daniel Stirnimann
Hi Mike,

When BIND first introduced this Case-Insensitive Response Compression
(See https://kb.isc.org/article/AA-01113) I found out that BIND
zone_name case sensitivity in a zone statement is preferred over name
case sensitivity in the zone itself.

So, you can get a google.COM answer because the zone statement at the
authoritative com name server you were talking to is "COM.".

Daniel

On 30.03.16 23:21, Mike Bernhardt wrote:
> I think you misunderstood me. I was getting back google.COM even when I
> queried the server using nslookup from a command prompt on my Windows
> desktop. The probe was failing because it is case-sensitive, but that was
> the symptom, not the problem.
> 
> For example:
>> google.com
> Server:  athena.bart.gov
> Address:  148.165.30.30
> 
> Non-authoritative answer:
> Name:google.COM
> Addresses:  2607:f8b0:4005:801::200e
>   172.217.3.46
> 
> Given that the problem cleared after restarting BIND on its CentOS host, I'd
> say the problem was BIND.
> 
> -Original Message-
> From: Mark Andrews [mailto:ma...@isc.org] 
> Sent: Tuesday, March 29, 2016 5:19 PM
> To: Mike Bernhardt
> Cc: bind-us...@isc.org
> Subject: Re: BIND started replying to queries for .com with .COM
> 
> 
> Your monitoring probe is broken.
> 
> STD 13 says that that the DNS is case preserving.  The problem is that lots
> of servers aren't case preserving instead they echo back the query case in
> the owner names of records returned which named then records.
> 
> In message <030101d18a06$fa21c8d0$ee655a70$@bart.gov>, "Mike Bernhardt"
> writes:
>> I rebooted one of our BIND VMs this morning. It's running BIND 
>> 9.10.3-P3. We noticed that queries for domains with domain.com were 
>> answered with domain.COM with the .COM in capital letters. Other 
>> high-levels like .org were not changed. It caused a monitoring probe 
>> to complain because it wasn't getting back what it asked for.
>>
>> Restarting bind on this server fixed the problem. Any ideas on what 
>> happened, or where to look?
>>
>> ___
>> Please visit https://lists.isc.org/mailman/listinfo/bind-users to 
>> unsubscribe from this list
>>
>> bind-users mailing list
>> bind-users@lists.isc.org
>> https://lists.isc.org/mailman/listinfo/bind-users
> --
> Mark Andrews, ISC
> 1 Seymour St., Dundas Valley, NSW 2117, Australia
> PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org
> 
> ___
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
> from this list
> 
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
> 
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users