Re: Frequent timeout
Hi, Here is a much more reasonable network capture during the period where there are numerous SERVFAIL errors from bind over a short period of high utilization. https://drive.google.com/file/d/1UrzvB-pumVjPvlmd6ZSnHi-XVynI8y3y/view?usp=sharing This is when our 20mbs cable upstream link was saturated and resulted in DNS query timeout errors. resulting in these SERVFAIL messages. The packet trace shows multiple TCP out-of-order and TCP Dup ACK packets. Would these retransmits cause enough of a delay for the queries to fail? Would someone more knowledgeable look into these packet errors for me? It might seem obvious that we should increase the bandwidth of our link, since it occurs during periods of high utilization, but it doesn't occur on our other 10mbs DIA links in the datacenter when the link is saturated. 11-Sep-2018 11:53:25.692 query-errors: info: client @0x7fc7ef343740 127.0.0.1#50821 (8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org): query failed (SERVFAIL) for 8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org/IN/A at ../../../bin/named/query.c:8580 11-Sep-2018 11:53:25.687 query-errors: debug 2: fetch completed at ../../../lib/dns/resolver.c:3927 for ac949d5d947f8f5cad13e98c68bac6f284c367fd.ebl.msbl.org/A in 30.84: timed out/success [domain:ebl.msbl.org,referral:0,restart:6,qrysent:11,timeout:10,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0] Thanks, Alex On Mon, Sep 10, 2018 at 12:11 PM Alex wrote: > > Hi, > > > >> tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap > > >> > > >> You don't need all of the extra stuff because -s0 captures the full > > >> packet. > > > > On 06.09.18 18:42, Alex wrote: > > >This is the command I ran to produce the pcap file I sent: > > > > > ># tcpdump -s0 -vv -i eth0 -nn -w domain-capture-eth0-090518.pcap udp > > >dst port domain > > > > and that is the problem. "dst port domain" captures packets going to DNS > > servers, not responses coming back. > > > > "-vv" and "-nn" are useless when producing packet capture and "-s0" is > > default for some time. I often add "-U" so file is flushed wich each packet. > > > > you can strip incoming queries by using filter > > > > "(src host 68.195.XXX.45 and dst port domain) or (src port domain and dst > > host 68.195.XXX.45)" > > I've generated a new tcpdump file using these criteria and uploaded it here: > https://drive.google.com/file/d/1F0VML8yPZJbcDZTys2hXDhjzv1UaBHuV/view?usp=sharing > > The SERVFAIL errors didn't really occur over the weekend. I believe it > has something to do with mail volume, link congestion/bandwidth > utilization. > > Thanks, > Alex > > > ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 On Tue, 2018-09-11 at 14:19 -0400, Alex wrote: > This is when our 20mbs cable upstream link was saturated and resulted > in DNS query timeout errors. resulting in these SERVFAIL messages. Not specific to dns, but this looks like a bufferbloat problem, which is common with cable modems. When the upstream link is saturated, the buffers in the interface device (cable modem or possibly a standalone router) become full. If there is a lot of buffer space, the latency becomes very large, and that will cause many problems, including issues with dns. A partial fix is to prioritize small packets like dns queries and tcp acks, so they don't wait behind a large queue of full size packets. A more complete fix is switching to fq-codel queue discipline. google for bufferbloat for more details. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.14 (GNU/Linux) iEYEAREKAAYFAluYDHMACgkQL6j7milTFsEqXwCffaR+fwcqpoEHPisw86Q49+Kw o0cAn0Q5LV1FXk2r1fiTqYZIlsa9xH3s =yp3H -END PGP SIGNATURE- ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
RE: Frequent timeout
If you use wireshark to slice n dice the pcap .. "dns.flags.rcode == 2" shows all of your SERVFAIL happens on localhost. If you switch to "dns.qry.name == storage.pardot.com" every single query is localhost. Unless you have another NIC that you are sending traffic over this does not look like a bandwidth issue at this particular point in time. John -Original Message- From: Alex [mailto:mysqlstud...@gmail.com] Sent: Tuesday, September 11, 2018 1:19 PM To: bind-users@lists.isc.org; John W. Blue Subject: Re: Frequent timeout Hi, Here is a much more reasonable network capture during the period where there are numerous SERVFAIL errors from bind over a short period of high utilization. https://drive.google.com/file/d/1UrzvB-pumVjPvlmd6ZSnHi-XVynI8y3y/view?usp=sharing This is when our 20mbs cable upstream link was saturated and resulted in DNS query timeout errors. resulting in these SERVFAIL messages. The packet trace shows multiple TCP out-of-order and TCP Dup ACK packets. Would these retransmits cause enough of a delay for the queries to fail? Would someone more knowledgeable look into these packet errors for me? It might seem obvious that we should increase the bandwidth of our link, since it occurs during periods of high utilization, but it doesn't occur on our other 10mbs DIA links in the datacenter when the link is saturated. 11-Sep-2018 11:53:25.692 query-errors: info: client @0x7fc7ef343740 127.0.0.1#50821 (8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org): query failed (SERVFAIL) for 8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org/IN/A at ../../../bin/named/query.c:8580 11-Sep-2018 11:53:25.687 query-errors: debug 2: fetch completed at ../../../lib/dns/resolver.c:3927 for ac949d5d947f8f5cad13e98c68bac6f284c367fd.ebl.msbl.org/A in 30.84: timed out/success [domain:ebl.msbl.org,referral:0,restart:6,qrysent:11,timeout:10,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0] Thanks, Alex On Mon, Sep 10, 2018 at 12:11 PM Alex wrote: > > Hi, > > > >> tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap > > >> > > >> You don't need all of the extra stuff because -s0 captures the full > > >> packet. > > > > On 06.09.18 18:42, Alex wrote: > > >This is the command I ran to produce the pcap file I sent: > > > > > ># tcpdump -s0 -vv -i eth0 -nn -w domain-capture-eth0-090518.pcap > > >udp dst port domain > > > > and that is the problem. "dst port domain" captures packets going to > > DNS servers, not responses coming back. > > > > "-vv" and "-nn" are useless when producing packet capture and "-s0" > > is default for some time. I often add "-U" so file is flushed wich each > > packet. > > > > you can strip incoming queries by using filter > > > > "(src host 68.195.XXX.45 and dst port domain) or (src port domain and dst > > host 68.195.XXX.45)" > > I've generated a new tcpdump file using these criteria and uploaded it here: > https://drive.google.com/file/d/1F0VML8yPZJbcDZTys2hXDhjzv1UaBHuV/view > ?usp=sharing > > The SERVFAIL errors didn't really occur over the weekend. I believe it > has something to do with mail volume, link congestion/bandwidth > utilization. > > Thanks, > Alex > > > ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
Hi, On Tue, Sep 11, 2018 at 2:47 PM John W. Blue wrote: > > If you use wireshark to slice n dice the pcap .. "dns.flags.rcode == 2" shows > all of your SERVFAIL happens on localhost. > > If you switch to "dns.qry.name == storage.pardot.com" every single query is > localhost. > > Unless you have another NIC that you are sending traffic over this does not > look like a bandwidth issue at this particular point in time. Thanks so much. I think I also may have confused things by suggesting it was related to bandwidth or utilization. I see it also happen now more regularly too. Can you ascertain why it is reporting these SERVFAILs? The queries are on localhost because /etc/resolv.conf lists localhost as the nameserver. Is that why we can't diagnose this? This most recent packet trace was started with "-i any". Why would the ones on localhost be the ones which are failing? I'm assuming postfix and/or some other process is querying bind on localhost to cause these errors? ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
RE: Frequent timeout
I will walk back my previous comments and just say that bandwidth may be in play because anytime you soak a circuit it is not good. Take a look at this query sequence: dns.qry.type == 28 && dns.qry.name == concured.co Packet 42356 shows a query for concurred.co. Packets 42357/8 show 68.195.193.45 relaying the query to 62.138.132.21. Packets 43015/16 show 62.138.132.21 replying with its query response to 68.195.193.45. And that's it. Nothing is seen being sent back to 127.0.0.1. At least on the wire. By way of comparison, packet 161 shows 127.0.0.1 answering itself so I would consider the previous no response a clue. Moving on: Packet 48874 shows 127.0.0.1 asking for a record again. This time we don’t see any external communication. Packet 87174 shows 127.0.0.1 replying with server failure. It took nearly 25 seconds to decide upon a SERVFAIL and that is another clue. That said, there a heaps of queries where DNS worked as expected. I really had to dig for the above examples because it seems like the vast majority of the server failure messages either do not get a reply on the localhost or we don’t see the routable adapter on the server attempting to reach out to get the answer. concurred.co is unique in that we see that attempt to reach out and no attempt. If the traffic that 127.0.0.1 is putting on the wire does not go out I am thinking firewall but you may be dealing with bandwidth exhaustion exclusively and it is presenting itself in this manner. Or you may have a server configuration issues or a server that is under powered. Sometimes pcap's are black and white and it gives you a "here is your problem" answer and other times it is like this where it does not give us anything conclusively to work with. Since this sever is sputtering around I would set about first stabilizing traffic from 127.0.0.1 going out. You need to see outbound traffic hit 127.0.0.1 then hit your external adapter without missing. Boom, boom, boom on down the line. Hopefully others may have better more insightful suggestions. Good hunting! John -Original Message- From: Alex [mailto:mysqlstud...@gmail.com] Sent: Tuesday, September 11, 2018 1:57 PM To: John W. Blue; bind-users@lists.isc.org Subject: Re: Frequent timeout Hi, On Tue, Sep 11, 2018 at 2:47 PM John W. Blue wrote: > > If you use wireshark to slice n dice the pcap .. "dns.flags.rcode == 2" shows > all of your SERVFAIL happens on localhost. > > If you switch to "dns.qry.name == storage.pardot.com" every single query is > localhost. > > Unless you have another NIC that you are sending traffic over this does not > look like a bandwidth issue at this particular point in time. Thanks so much. I think I also may have confused things by suggesting it was related to bandwidth or utilization. I see it also happen now more regularly too. Can you ascertain why it is reporting these SERVFAILs? The queries are on localhost because /etc/resolv.conf lists localhost as the nameserver. Is that why we can't diagnose this? This most recent packet trace was started with "-i any". Why would the ones on localhost be the ones which are failing? I'm assuming postfix and/or some other process is querying bind on localhost to cause these errors? ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users