servfail only for a zone

2015-07-13 Thread Lucio Crusca

Hello,

I have two nameservers, the master and its slave, and they work ok for 
several zones. However for one of the zones (aquilacorde.com), the slave 
replies with SERVFAIL, and I don't understand why.


The master is ns1.virtualbit.it, the slave is ns2.virtualbit.it.

I've tried enabling debug output for ns2, but in the logs I get nothing 
useful (or nothing I can understand). Here is what I get in the logs 
when I try


$ dig @ns2.virtualbit.it aquilacorde.com

8x---

# tail -f /var/log/syslog | grep named
...
Jul 13 19:10:46 apps named[25270]: client 74.125.187.80#33867 
(aquilacorde.com): query: aquilacorde.com IN MX -EDC (136.243.232.143)
Jul 13 19:10:46 apps named[25270]: client 74.125.187.80#33867 
(aquilacorde.com): query failed (SERVFAIL) for aquilacorde.com/IN/MX at 
query.c:5813
Jul 13 19:10:46 apps named[25270]: client 74.125.187.80#42506 
(aquilacorde.com): query: aquilacorde.com IN MX -EDC (136.243.232.143)
Jul 13 19:10:46 apps named[25270]: client 74.125.187.80#42506 
(aquilacorde.com): query failed (SERVFAIL) for aquilacorde.com/IN/MX at 
query.c:5813
Jul 13 19:10:46 apps named[25270]: client 74.125.187.82#59535 
(aquilacorde.com): query: aquilacorde.com IN MX -C (136.243.232.143)
Jul 13 19:10:46 apps named[25270]: client 74.125.187.82#59535 
(aquilacorde.com): query failed (SERVFAIL) for aquilacorde.com/IN/MX at 
query.c:5813

...

And here is the aquilacorde.com zonefile at the master ns1:

$TTL3600
@   IN  SOA ns1.virtualbit.it. info.aquilacorde.com. (
2015070601 ; Serial
  1200 ; Refresh
180 ; Retry
3600 ; Expire
 3600); Default TTL
;
@   IN  NS  ns1.virtualbit.it.
@INNSns2.virtualbit.it.
aquilacorde.com. IN  MX   1 aspmx.l.google.com.
aquilacorde.com. IN  MX   5 alt1.aspmx.l.google.com.
aquilacorde.com. IN  MX   5 alt2.aspmx.l.google.com.
aquilacorde.com. IN  MX   10 alt3.aspmx.l.google.com.
aquilacorde.com. IN  MX   10 alt4.aspmx.l.google.com.
aquilacorde.com.INA136.243.232.141
newINA136.243.232.141
wwwINA136.243.232.141
www2INA136.243.232.141
ftpINA136.243.232.141
betaINA136.243.232.141
shopINA136.243.232.141
ricercheIN  A   136.243.232.141
old IN  A   195.138.240.116
start   INCNAMEghs.google.com.
sites   IN  CNAME   ghs.google.com.
mailIN  CNAME   ghs.google.com.
calendarIN  CNAME   ghs.google.com.
docsIN  CNAME   ghs.google.com.
googlec4e941738b3160fb IN CNAME google.com.

What's wrong?
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: servfail only for a zone

2015-07-13 Thread Reindl Harald


Am 13.07.2015 um 19:19 schrieb Lucio Crusca:

I have two nameservers, the master and its slave, and they work ok for
several zones. However for one of the zones (aquilacorde.com), the slave
replies with SERVFAIL, and I don't understand why


check if the zone failed to update from the master and has expired, been 
there due a cisco router with "DNS ALG" enabled leading only a few large 
zones fail to transfer




signature.asc
Description: OpenPGP digital signature
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: servfail only for a zone

2015-07-13 Thread John Miller
Something I'm noticing is that your SOA record fields are quite small:

aquilacorde.com.3600INSOAns1.virtualbit.it.
info.aquilacorde.com. 2015070601 1200 180 3600 3600

Specifically, your expiration time (first of the 3600s) is set to one
hour.  This means that if ns2 hasn't contacted ns1 in an hour, the zone
will be invalid on ns2.  If you're making a whole ton of updates, then the
small times make sense, but for the zone you posted, that doesn't seem to
be the case.  Normally it's not a problem, but if you can't respond to a
communication outage between the two nameservers within an hour, the second
will stop working.

This is just a guess, but network communication/failed zone transfer seems
the most likely culprit for something like this (entire zone returns
SERVFAIL).

John
-- 
John Miller
Systems Engineer
Brandeis University
johnm...@brandeis.edu

On Mon, Jul 13, 2015 at 1:19 PM, Lucio Crusca  wrote:

>
> And here is the aquilacorde.com zonefile at the master ns1:
>
>
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: servfail only for a zone

2015-07-13 Thread Lucio Crusca



Il 13/07/2015 19:21, Reindl Harald ha scritto:


check if the zone failed to update from the master and has expired, 
been there due a cisco router with "DNS ALG" enabled leading only a 
few large zones fail to transfer




Yes the zone failed to update, I know because if I raise the seqno @ns1, 
it tries to update and it keeps failing. I don't understand why it 
fails. I doubt a Cisco router is to blame here because ns1 and ns2 are 
two guests of the same host, no routers between them.



___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: servfail only for a zone

2015-07-13 Thread Charles Swiger
On Jul 13, 2015, at 10:34 AM, Lucio Crusca  wrote:
[ ... ]
> Yes the zone failed to update, I know because if I raise the seqno @ns1, it 
> tries to update and it keeps failing. I don't understand why it fails. I 
> doubt a Cisco router is to blame here because ns1 and ns2 are two guests of 
> the same host, no routers between them.

Note that one should put your nameservers on different physical hardware to 
ensure high availability.

(Different HW located in different datacenters using different upstream network 
providers, ideally)

Regards,
-- 
-Chuck

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

RE: servfail only for a zone

2015-07-13 Thread Darcy Kevin (FCA)
Expiration values should be set long enough to detect the zone-transfer 
problems, and react to them, but not so long that if the zone does eventually 
expire after being deliberately removed from the master (but not the slaves), 
everyone is not sitting around, scratching their heads, going “what zone was 
that again? I don’t even remember that being in our config. Maybe it was 
something my predecessor added and never documented…”

A week is pretty much the bare minimum I’d want to see an EXPIRE set to, but 
typically I’ve set it to 1000 hours (3,600,000 seconds), which is more than 41 
days.

Half an hour is ridiculous, to be honest. Unless you have 24x7x365 
eyes-on-glass looking for zone transfer failures constantly and ready and able 
to instantly pounce on any such problems and fix them within minutes.


- Kevin

From: bind-users-boun...@lists.isc.org 
[mailto:bind-users-boun...@lists.isc.org] On Behalf Of John Miller
Sent: Monday, July 13, 2015 1:33 PM
To: Lucio Crusca
Cc: bind-users
Subject: Re: servfail only for a zone

Something I'm noticing is that your SOA record fields are quite small:

aquilacorde.com.3600INSOA
ns1.virtualbit.it. 
info.aquilacorde.com. 2015070601 1200 180 3600 3600
Specifically, your expiration time (first of the 3600s) is set to one hour.  
This means that if ns2 hasn't contacted ns1 in an hour, the zone will be 
invalid on ns2.  If you're making a whole ton of updates, then the small times 
make sense, but for the zone you posted, that doesn't seem to be the case.  
Normally it's not a problem, but if you can't respond to a communication outage 
between the two nameservers within an hour, the second will stop working.
This is just a guess, but network communication/failed zone transfer seems the 
most likely culprit for something like this (entire zone returns SERVFAIL).
John
--
John Miller
Systems Engineer
Brandeis University
johnm...@brandeis.edu
On Mon, Jul 13, 2015 at 1:19 PM, Lucio Crusca 
mailto:lu...@sulweb.org>> wrote:

And here is the aquilacorde.com zonefile at the master 
ns1:


___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: servfail only for a zone

2015-07-13 Thread Lucio Crusca


Il 13/07/2015 19:51, Darcy Kevin (FCA) ha scritto:
Half an hour is ridiculous, to be honest. Unless you have 24x7x365 
eyes-on-glass looking for zone transfer failures *constantly* and 
ready and able to *instantly* pounce on any such problems and fix them 
within minutes.


You have been persuasive enough, I'm definitely going to raise the 
expire value, but now the question is: are the SERVFAIL replies a 
consequence of the low expire value?


___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

RE: servfail only for a zone

2015-07-13 Thread Darcy Kevin (FCA)
They are a consequence of a slave not being able to perform a refresh from the 
master for a while. This may be because the zone is removed from the master 
(but someone overlooked removing it from the slave(s) as well), bad permissions 
set on the master, a communications error, to name a few. If your EXPIRE is too 
low, you don’t give yourself enough time to catch and correct those errors.


- Kevin

From: bind-users-boun...@lists.isc.org 
[mailto:bind-users-boun...@lists.isc.org] On Behalf Of Lucio Crusca
Sent: Monday, July 13, 2015 2:15 PM
To: bind-users
Subject: Re: servfail only for a zone


Il 13/07/2015 19:51, Darcy Kevin (FCA) ha scritto:
Half an hour is ridiculous, to be honest. Unless you have 24x7x365 
eyes-on-glass looking for zone transfer failures constantly and ready and able 
to instantly pounce on any such problems and fix them within minutes.

You have been persuasive enough, I'm definitely going to raise the expire 
value, but now the question is: are the SERVFAIL replies a consequence of the 
low expire value?
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: servfail only for a zone

2015-07-13 Thread Reindl Harald


Am 13.07.2015 um 20:15 schrieb Lucio Crusca:

Il 13/07/2015 19:51, Darcy Kevin (FCA) ha scritto:

Half an hour is ridiculous, to be honest. Unless you have 24x7x365
eyes-on-glass looking for zone transfer failures *constantly* and
ready and able to *instantly* pounce on any such problems and fix them
within minutes.


You have been persuasive enough, I'm definitely going to raise the
expire value, but now the question is: are the SERVFAIL replies a
consequence of the low expire value?


most likely yes

zone transerfs are retried often, but that don't help with such low 
expire times, the question still remains why they are failing on the 
same host, but that's not a bind problem


as somebody else said: you must not run both nameservers on the same 
host with the same internet connection, virtualization is fine but you 
need at least a real HA cluster and independent lines for both to 
minimze the possibility both nameservers are going down at the same time


i would recommend running http://www.intodns.com/ regulary for your domains

http://www.intodns.com/aquilacorde.com is a horrible result!





signature.asc
Description: OpenPGP digital signature
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: servfail only for a zone

2015-07-13 Thread Lucio Crusca



Il 13/07/2015 20:21, Reindl Harald ha scritto:


zone transerfs are retried often, but that don't help with such low 
expire times, the question still remains why they are failing on the 
same host, but that's not a bind problem


I'm pretty sure it's not a bind problem (I'm not pretending it's a bind 
bug), but I need help to spot the problem, because from bind logs it is 
not clear to me what the problem is. Network connectivity between ns1 
and ns2 is obviously 100% reliable, it's the same host, and other zones 
do update correctly, so what does aquilacorde.com zone has that blocks 
updates?


as somebody else said: you must not run both nameservers on the same 
host with the same internet connection, virtualization is fine but you 
need at least a real HA cluster and independent lines for both to 
minimze the possibility both nameservers are going down at the same time


I got it, thanks, and I also already have another server at another 
provider in a different country I can use for the job. However, for the 
time being, I'm more interested in understanding what's going on than in 
adding random variables to the system, such as possibly unreliable 
internet connection.


___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: servfail only for a zone

2015-07-13 Thread John Miller
On Mon, Jul 13, 2015 at 2:15 PM, Lucio Crusca  wrote:

>
> You have been persuasive enough, I'm definitely going to raise the expire
> value, but now the question is: are the SERVFAIL replies a consequence of
> the low expire value?
>

It doesn't help your cause _at_all_.  There could be a few reasons why
you're getting SERVFAIL responses from your second nameserver, but the zone
being expired is the most likely.  Check everything:

- physical connectivity between ns2 and ns1
- zone transfer settings (allow-transfer, allow-notify, TSIG settings and
keys, etc.)

A sample troubleshooting sequence run from ns2 might look something like:

- Can you ping ns1 from ns2?
- Can you query ns1 (dig @ns1) from ns2?
- Can you do a manual zone transfer from ns1 to ns2: dig @ns1
aquilacorde.com AXFR
- If you're using TSIG for your zone transfers, you'll need to set the
appropriate options in dig.
- On ns2, can you run "rndc reload" on aquilacorde.com?  What do your logs
say when you do this?
- What happens when you increment the zone's serial number on ns1?  Does
ns1 automatically send a NOTIFY?
- If you're able (there aren't other zones to worry about), what happens
when you restart BIND on ns2?  What do the logs say?

If you've done most of these troubleshooting steps, you'll know whether you
have:
- basic network connectivity
- basic DNS connectivity (UDP port 53)
- DNS zone transfer connectivity (TCP port 53; AXFR uses TCP)
- DNS zone transfer ability
- useful logging

and... CHANGE YOUR EXPIRE VALUE NOW!!

John
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: BIND slave server ignoring responses to all UDP-based SOA queries (zone refresh) for hours at a time

2015-07-13 Thread Irwin Tillman
At 07 Jul 2015 13:47:45 +0100, Cathy Almond  wrote:

>What can happen (and this is really really subtle) is that if there are
>some source ports that named could randomly select, but where
>intermediate firewalls or filters are just dropping, either the SOA
>refresh queries, or the responses, then named can 'get stuck' on using
>and re-using the same refresh source port.
>...

Thank you, that was exactly the cause, and the fix.

Some years ago I'd updated a host-based firewall running on my BIND slave 
server 
to block traffic to an additional inbound UDP port that falls
into the range BIND may use for ephemeral ports. At that time I neglected to
add that port to BIND's config (avoid-v4-udp-ports and avoid-v6-udp-ports).

When BIND picked that src port for its UDP SOA queries, the incoming SOA replies
were blocked by that firewall.  

For some reason BIND wasn't picking that port often (or wasn't getting stuck on
that port for long enough for me to notice) until I recently made an apparently
unrelated config change (expanding the use of request-ixfr) to my BIND slave
server.  Once I made that change, BIND got stuck on that port 
(for all the SOA queries all the zones it pulled from various unrelated
masters) for hours at a time every 1-3 days (until picking another port),
exposing my latent configuration problem.

Irwin Tillman
OIT Networking & Monitoring Systems, Princeton University

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Zone refresh error: refresh: retry limit for master a.b.c.d#53 exceeded

2015-07-13 Thread Anand Buddhdev
Dear BIND users and developers,

I have 2 BIND 9.10.2-P2 servers, on the same OS and OS version, on
different networks, configured as slaves for many zones.

On one server, everything works well, and there isn't even a single
error in the log. But on the other, I see lots of errors like this:

13-Jul-2015 17:06:33.356 general: zone Z/IN/main: refresh: retry limit
for master a.b.c.d#53 exceeded (source 0.0.0.0#0)
13-Jul-2015 17:07:03.681 general: zone Z/IN/main: refresh: retry limit
for master a.b.c.e#53 exceeded (source 0.0.0.0#0)
13-Jul-2015 17:07:34.517 general: zone Z/IN/main: refresh: retry limit
for master a.b.c.f#53 exceeded (source 0.0.0.0#0)

My understanding of this error is that a SOA query over UDP for the zone
failed. However, if I use dig on this server where I see errors in the
log, to query for the SOA record of the zone, it succeeds, against each
master. There are no errors, and no timeouts.

On both servers, "try-tcp-refresh" is set to "no", because I don't want
the servers wasting time with TCP and timing out, if the UDP SOA query
has failed. Both zones have 4826 zones on them.

If I run tcpdump, I see queries for SOA records originating from the
server to the masters, and responses arriving from the masters, from the
correct source address and port, so the master servers are certainly okay.

The effect of these errors on the server are that zones on it are
frequently late with updates, whereas the other server updates promptly.

So what could cause these SOA lookup failures in BIND on one server, but
not another? Could the developers tell me how BIND does SOA queries over
UDP, and is there any way to mimic this with dig?

Regards,

Anand Buddhdev
RIPE NCC
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Zone refresh error: refresh: retry limit for master a.b.c.d#53 exceeded

2015-07-13 Thread Anand Buddhdev
On 13/07/15 21:31, Anand Buddhdev wrote:

> So what could cause these SOA lookup failures in BIND on one server, but
> not another? Could the developers tell me how BIND does SOA queries over
> UDP, and is there any way to mimic this with dig?

Oops. I just noticed Cathy Almond's response to Irwin Tillman, and
recognised the symptom. It turns out that our network guys are blocking
outbound UDP queries with a source port of 2049, and BIND is getting
stuck on this. Now that I know the problem, I know whom to chase for a
solution.

Apologies for wasting everyone's time with my rather long post. I should
have read the archives of the list first!

Regards,
Anand
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Zone refresh error: refresh: retry limit for master a.b.c.d#53 exceeded

2015-07-13 Thread Reindl Harald



Am 13.07.2015 um 21:46 schrieb Anand Buddhdev:

On 13/07/15 21:31, Anand Buddhdev wrote:


So what could cause these SOA lookup failures in BIND on one server, but
not another? Could the developers tell me how BIND does SOA queries over
UDP, and is there any way to mimic this with dig?


Oops. I just noticed Cathy Almond's response to Irwin Tillman, and
recognised the symptom. It turns out that our network guys are blocking
outbound UDP queries with a source port of 2049, and BIND is getting
stuck on this. Now that I know the problem, I know whom to chase for a
solution.

Apologies for wasting everyone's time with my rather long post. I should
have read the archives of the list first!


greetings to the firewall admins

* they should monitor their logs
* additional:  -m conntrack --ctstate NEW may help in general



signature.asc
Description: OpenPGP digital signature
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: servfail only for a zone

2015-07-13 Thread Lucio Crusca



Il 13/07/2015 20:47, John Miller ha scritto:

the zone being expired is the most likely.  Check everything:

- physical connectivity between ns2 and ns1


That was the problem. I recently changed iptables rules on ns1 and 
forgot to test this little thing. The other zones weren't failing 
because they had not been changed recently, so no updates were needed 
for them.


Thanks a lot.



and... CHANGE YOUR EXPIRE VALUE NOW!!



I did. intodns.com is not happy nevertheless. But I did.
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users