Re: Frequent timeout

2018-09-11 Thread Alex
Hi,

Here is a much more reasonable network capture during the period where
there are numerous SERVFAIL errors from bind over a short period of
high utilization.
https://drive.google.com/file/d/1UrzvB-pumVjPvlmd6ZSnHi-XVynI8y3y/view?usp=sharing

This is when our 20mbs cable upstream link was saturated and resulted
in DNS query timeout errors. resulting in these SERVFAIL messages.

The packet trace shows multiple TCP out-of-order and TCP Dup ACK
packets. Would these retransmits cause enough of a delay for the
queries to fail?

Would someone more knowledgeable look into these packet errors for me?

It might seem obvious that we should increase the bandwidth of our
link, since it occurs during periods of high utilization, but it
doesn't occur on our other 10mbs DIA links in the datacenter when the
link is saturated.

11-Sep-2018 11:53:25.692 query-errors: info: client @0x7fc7ef343740
127.0.0.1#50821
(8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org): query failed
(SERVFAIL) for 8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org/IN/A
at ../../../bin/named/query.c:8580

11-Sep-2018 11:53:25.687 query-errors: debug 2: fetch completed at
../../../lib/dns/resolver.c:3927 for
ac949d5d947f8f5cad13e98c68bac6f284c367fd.ebl.msbl.org/A in 30.84:
timed out/success
[domain:ebl.msbl.org,referral:0,restart:6,qrysent:11,timeout:10,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]

Thanks,
Alex

On Mon, Sep 10, 2018 at 12:11 PM Alex  wrote:
>
> Hi,
>
> > >> tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap
> > >>
> > >> You don't need all of the extra stuff because -s0 captures the full 
> > >> packet.
> >
> > On 06.09.18 18:42, Alex wrote:
> > >This is the command I ran to produce the pcap file I sent:
> > >
> > ># tcpdump -s0 -vv -i eth0 -nn -w domain-capture-eth0-090518.pcap udp
> > >dst port domain
> >
> > and that is the problem. "dst port domain" captures packets going to DNS
> > servers, not responses coming back.
> >
> > "-vv" and "-nn" are useless when producing packet capture and "-s0" is
> > default for some time. I often add "-U" so file is flushed wich each packet.
> >
> > you can strip incoming queries by using filter
> >
> > "(src host 68.195.XXX.45 and dst port domain) or (src port domain and dst 
> > host 68.195.XXX.45)"
>
> I've generated a new tcpdump file using these criteria and uploaded it here:
> https://drive.google.com/file/d/1F0VML8yPZJbcDZTys2hXDhjzv1UaBHuV/view?usp=sharing
>
> The SERVFAIL errors didn't really occur over the weekend. I believe it
> has something to do with mail volume, link congestion/bandwidth
> utilization.
>
> Thanks,
> Alex
>
>
>
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-09-11 Thread Carl Byington
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On Tue, 2018-09-11 at 14:19 -0400, Alex wrote:
> This is when our 20mbs cable upstream link was saturated and resulted
> in DNS query timeout errors. resulting in these SERVFAIL messages.

Not specific to dns, but this looks like a bufferbloat problem, which is
common with cable modems. When the upstream link is saturated, the
buffers in the interface device (cable modem or possibly a standalone
router) become full. If there is a lot of buffer space, the latency
becomes very large, and that will cause many problems, including issues
with dns. A partial fix is to prioritize small packets like dns queries
and tcp acks, so they don't wait behind a large queue of full size
packets. A more complete fix is switching to fq-codel queue discipline.

google for bufferbloat for more details.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEAREKAAYFAluYDHMACgkQL6j7milTFsEqXwCffaR+fwcqpoEHPisw86Q49+Kw
o0cAn0Q5LV1FXk2r1fiTqYZIlsa9xH3s
=yp3H
-END PGP SIGNATURE-


___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: Frequent timeout

2018-09-11 Thread John W. Blue
If you use wireshark to slice n dice the pcap .. "dns.flags.rcode == 2" shows 
all of your SERVFAIL happens on localhost.

If you switch to "dns.qry.name == storage.pardot.com" every single query is 
localhost.

Unless you have another NIC that you are sending traffic over this does not 
look like a bandwidth issue at this particular point in time.

John

-Original Message-
From: Alex [mailto:mysqlstud...@gmail.com] 
Sent: Tuesday, September 11, 2018 1:19 PM
To: bind-users@lists.isc.org; John W. Blue
Subject: Re: Frequent timeout

Hi,

Here is a much more reasonable network capture during the period where there 
are numerous SERVFAIL errors from bind over a short period of high utilization.
https://drive.google.com/file/d/1UrzvB-pumVjPvlmd6ZSnHi-XVynI8y3y/view?usp=sharing

This is when our 20mbs cable upstream link was saturated and resulted in DNS 
query timeout errors. resulting in these SERVFAIL messages.

The packet trace shows multiple TCP out-of-order and TCP Dup ACK packets. Would 
these retransmits cause enough of a delay for the queries to fail?

Would someone more knowledgeable look into these packet errors for me?

It might seem obvious that we should increase the bandwidth of our link, since 
it occurs during periods of high utilization, but it doesn't occur on our other 
10mbs DIA links in the datacenter when the link is saturated.

11-Sep-2018 11:53:25.692 query-errors: info: client @0x7fc7ef343740
127.0.0.1#50821
(8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org): query failed
(SERVFAIL) for 8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org/IN/A
at ../../../bin/named/query.c:8580

11-Sep-2018 11:53:25.687 query-errors: debug 2: fetch completed at
../../../lib/dns/resolver.c:3927 for
ac949d5d947f8f5cad13e98c68bac6f284c367fd.ebl.msbl.org/A in 30.84:
timed out/success
[domain:ebl.msbl.org,referral:0,restart:6,qrysent:11,timeout:10,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]

Thanks,
Alex

On Mon, Sep 10, 2018 at 12:11 PM Alex  wrote:
>
> Hi,
>
> > >> tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap
> > >>
> > >> You don't need all of the extra stuff because -s0 captures the full 
> > >> packet.
> >
> > On 06.09.18 18:42, Alex wrote:
> > >This is the command I ran to produce the pcap file I sent:
> > >
> > ># tcpdump -s0 -vv -i eth0 -nn -w domain-capture-eth0-090518.pcap 
> > >udp dst port domain
> >
> > and that is the problem. "dst port domain" captures packets going to 
> > DNS servers, not responses coming back.
> >
> > "-vv" and "-nn" are useless when producing packet capture and "-s0" 
> > is default for some time. I often add "-U" so file is flushed wich each 
> > packet.
> >
> > you can strip incoming queries by using filter
> >
> > "(src host 68.195.XXX.45 and dst port domain) or (src port domain and dst 
> > host 68.195.XXX.45)"
>
> I've generated a new tcpdump file using these criteria and uploaded it here:
> https://drive.google.com/file/d/1F0VML8yPZJbcDZTys2hXDhjzv1UaBHuV/view
> ?usp=sharing
>
> The SERVFAIL errors didn't really occur over the weekend. I believe it 
> has something to do with mail volume, link congestion/bandwidth 
> utilization.
>
> Thanks,
> Alex
>
>
>
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-09-11 Thread Alex
Hi,

On Tue, Sep 11, 2018 at 2:47 PM John W. Blue  wrote:
>
> If you use wireshark to slice n dice the pcap .. "dns.flags.rcode == 2" shows 
> all of your SERVFAIL happens on localhost.
>
> If you switch to "dns.qry.name == storage.pardot.com" every single query is 
> localhost.
>
> Unless you have another NIC that you are sending traffic over this does not 
> look like a bandwidth issue at this particular point in time.

Thanks so much. I think I also may have confused things by suggesting
it was related to bandwidth or utilization. I see it also happen now
more regularly too.

Can you ascertain why it is reporting these SERVFAILs?

The queries are on localhost because /etc/resolv.conf lists localhost
as the nameserver. Is that why we can't diagnose this? This most
recent packet trace was started with "-i any". Why would the ones on
localhost be the ones which are failing? I'm assuming postfix and/or
some other process is querying bind on localhost to cause these
errors?
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: Frequent timeout

2018-09-11 Thread John W. Blue
I will walk back my previous comments and just say that bandwidth may be in 
play because anytime you soak a circuit it is not good.

Take a look at this query sequence:

dns.qry.type == 28 && dns.qry.name == concured.co

Packet 42356 shows a  query for concurred.co.
Packets 42357/8 show 68.195.193.45 relaying the query to 62.138.132.21.
Packets 43015/16 show 62.138.132.21 replying with its query response to 
68.195.193.45.

And that's it.  Nothing is seen being sent back to 127.0.0.1.  At least on the 
wire.  By way of comparison, packet 161 shows 127.0.0.1 answering itself so I 
would consider the previous no response a clue.

Moving on:

Packet 48874 shows 127.0.0.1 asking for a  record again.
This time we don’t see any external communication.
Packet 87174 shows 127.0.0.1 replying with server failure.

It took nearly 25 seconds to decide upon a SERVFAIL and that is another clue.

That said, there a heaps of queries where DNS worked as expected.  I really had 
to dig for the above examples because it seems like the vast majority of the 
server failure messages either do not get a reply on the localhost or we don’t 
see the routable adapter on the server attempting to reach out to get the 
answer.  concurred.co is unique in that we see that attempt to reach out and no 
attempt.

If the traffic that 127.0.0.1 is putting on the wire does not go out I am 
thinking firewall but you may be dealing with bandwidth exhaustion exclusively 
and it is presenting itself in this manner.  Or you may have a server 
configuration issues or a server that is under powered.

Sometimes pcap's are black and white and it gives you a "here is your problem" 
answer and other times it is like this where it does not give us anything 
conclusively to work with.  Since this sever is sputtering around I would set 
about first stabilizing traffic from 127.0.0.1 going out.  You need to see 
outbound traffic hit 127.0.0.1 then hit your external adapter without missing.  
Boom, boom, boom on down the line.

Hopefully others may have better more insightful suggestions.

Good hunting!

John

-Original Message-
From: Alex [mailto:mysqlstud...@gmail.com] 
Sent: Tuesday, September 11, 2018 1:57 PM
To: John W. Blue; bind-users@lists.isc.org
Subject: Re: Frequent timeout

Hi,

On Tue, Sep 11, 2018 at 2:47 PM John W. Blue  wrote:
>
> If you use wireshark to slice n dice the pcap .. "dns.flags.rcode == 2" shows 
> all of your SERVFAIL happens on localhost.
>
> If you switch to "dns.qry.name == storage.pardot.com" every single query is 
> localhost.
>
> Unless you have another NIC that you are sending traffic over this does not 
> look like a bandwidth issue at this particular point in time.

Thanks so much. I think I also may have confused things by suggesting it was 
related to bandwidth or utilization. I see it also happen now more regularly 
too.

Can you ascertain why it is reporting these SERVFAILs?

The queries are on localhost because /etc/resolv.conf lists localhost as the 
nameserver. Is that why we can't diagnose this? This most recent packet trace 
was started with "-i any". Why would the ones on localhost be the ones which 
are failing? I'm assuming postfix and/or some other process is querying bind on 
localhost to cause these errors?
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users