I've been trying to track down a weird problem with our mail system
suddenly believing that a host does not exist, or timing out in DNS.
I tracked it down to the DNS server, but I am not entirely sure what is
going on. What appears to be happening is that the glue IN A record
for the NS server for a domain is getting lost, and the NS record is
remaining. When named gets into this state, it doesn't seem to be able
to recover... it sees the NS record but it can't resolve it because
the glue record is gone, and it doesn't try to get it after that.
If you look at the cache dumps and dig output below, you can clearly
see the timeout for fuji.jamcracker.com is less then the timeout
for jamcracker.com AFTER we've looked up other elements for fuji,
which means that when it timed out, that IN A record will be gone.
But that IN A record is the IP address for the NS. So when it times
out, the jamcracker entry is left there with no NS records whatsoever.
I believe what is happening is that something is looking up other
records for fuji, and this is replacing the original glue record with
the real IN A record, but also changing the timeouts somehow and
causing fuji's record to timeout early.
As far as I can tell, this is an extremely serious bug in named. I am
running 8.2.3.
This has occured with several mail destinations, not just jamcracker.
I went through jamcrackers whole DNS hierarchy and everything is setup
properly, including all the timeouts (they are all set to 3600 seconds).
Has anyone else seen this? Anyone know what is going on here?
-Matt
---
Here is a cache dump of a case where 'nslookup -query=mx jamcracker.com'
no longer works. Everything with jamcracker in it is being dumped:
jamcracker 2436 IN SOA fuji.jamcracker.com.
hostmaster.jamcracker.com. (
2001062900 10800 3600 1728000 3600 ) ;Cr=auth [216.32.126.150]
; 2436 IN AAAA fuji.jamcracker.com. hostmaster.jamcracker.com. (
; 2001062900 10800 3600 1728000 3600 );jamcracker.com.;NODATA ;-$
;Cr=auth [216.32.126.150]
2436 IN NS fuji.jamcracker.com. ;Cr=auth [216.32.126.150]
2436 IN A 66.35.217.100 ;Cr=auth [216.32.126.150]
And here is a cache dump after I restart named and do the same nslookup:
jamcracker 3591 IN NS fuji.jamcracker.com. ;Cr=auth
[216.32.126.150]
3591 IN MX 5 va2mc.ummailbox.net. ;Cr=auth [216.32.126.150]
$ORIGIN jamcracker.com.
fuji 3591 IN A 66.35.220.151 ;Cr=addtnl [216.32.126.150]
And here is a dump after named has been running a while:
jamcracker 2016 IN NS fuji.jamcracker.com. ;Cr=auth
[216.32.126.150]
2016 IN MX 5 va2mc.ummailbox.net. ;Cr=auth [216.32.126.150]
2206 IN SOA fuji.jamcracker.com. hostmaster.jamcracker.com. (
2001062900 10800 3600 1728000 3600 ) ;Cr=auth [66.35.220.151]
; 2206 IN AAAA fuji.jamcracker.com. hostmaster.jamcracker.com. (
; 2001062900 10800 3600 1728000 3600 );jamcracker.com.;NODATA ;-$
;Cr=auth [66.35.220.151]
3140 IN A 66.35.217.100 ;Cr=auth [66.35.220.151]
$ORIGIN jamcracker.com.
fuji 1846 IN A 66.35.220.151 ;NT=13 Cr=addtnl [216.32.126.150]
; 2213 IN AAAA fuji.jamcracker.com. hostmaster.jamcracker.com. (
; 2001062900 10800 3600 1728000 3600 );jamcracker.com.;NODATA ;-$
;
And here is the dig output.
earth:/etc/namedb# dig jamcracker.com
; <<>> DiG 8.3 <<>> jamcracker.com
;; res options: init recurs defnam dnsrch
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1
;; QUERY SECTION:
;; jamcracker.com, type = A, class = IN
;; ANSWER SECTION:
jamcracker.com. 50m27s IN A 66.35.217.100
;; AUTHORITY SECTION:
jamcracker.com. 31m43s IN NS fuji.jamcracker.com.
;; ADDITIONAL SECTION:
fuji.jamcracker.com. 28m53s IN A 66.35.220.151
;; Total query time: 1 msec
;; FROM: earth.backplane.com to SERVER: default -- 127.0.0.1
;; WHEN: Mon Jul 16 17:36:13 2001
;; MSG SIZE sent: 32 rcvd: 83
To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message