Hi,

I wondered a while whether this would be more appropriate to post
here or as an issue in ISC's gitlab, but came to the conclusion
that for now the best place would be here.  The reason is that
the "how to reproduce the problem" bit is quite fuzzy.

If someone from ISC wants this reported as a gitlab issue as
well, I can do that, of course.

Context: we are running 4 nodes in an anycast setup, providing
our users with DNS recursor service, and RPZ service to a subset
of these users.

We have been using BIND 9.20 for a while, and have followed the
ISC upgrades shortly after they were published, so we were up
until recently running 9.20.6 for this service.

Recently we started receiving reports from some of our users that
... "DNS lookups are un-reliable".  An example which I managed to
catch / reproduce (based on a report for one of the other 3
nodes):

$ dig @osl-res.uninett.no. freebsd.org. a

; <<>> DiG 9.14.7 <<>> @osl-res.uninett.no. freebsd.org. a
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 51745
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 16c89ea584a0a45c0100000067d2ad42211e91f71ee4fdcc (good)
;; QUESTION SECTION:
;freebsd.org.                   IN      A

;; Query time: 27 msec
;; SERVER: 2001:700:0:102::ca53#53(2001:700:0:102::ca53)
;; WHEN: Thu Mar 13 11:02:42 CET 2025
;; MSG SIZE  rcvd: 68

$ dig @osl-res.uninett.no. freebsd.org. a

; <<>> DiG 9.14.7 <<>> @osl-res.uninett.no. freebsd.org. a
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2380
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 893098498db1b2330100000067d2ad4b2f511f5ac2cf4c48 (good)
;; QUESTION SECTION:
;freebsd.org.                   IN      A

;; ANSWER SECTION:
freebsd.org.            3600    IN      A       96.47.72.84

;; Query time: 30 msec
;; SERVER: 2001:700:0:102::ca53#53(2001:700:0:102::ca53)
;; WHEN: Thu Mar 13 11:02:51 CET 2025
;; MSG SIZE  rcvd: 84

$

The name server in question does not have any connectivity issues
that I'm aware of, and ... it really doesn't make a whole lot of
sense to me that it would at one instant reply with SERVFAIL only
to seconds later respond with a DNSSEC-validated OK reply.  I've
unsuccessfuly looked in the logs for the SERVFAIL for this
domain, but apparently our logging does not catch those.

At the time when this was done, the name server had been running
for weeks:

osl-res: {1} ps axu | egrep 'PID|named'
USER       PID %CPU %MEM     VSZ    RSS TTY     STAT STARTED        TIME COMMAND
named     6739  114  2.6 1363112 866384 ?       Osl  27Feb25 14435:20.10 /usr/p
osl-res: {2} 

This node serves in the order of peak around 3000 qps, and rarely
if ever serves less than 700 qps during a 24-hour cycle.  This
makes it somewhere between difficult and impossible to provide a
precise reproducer description which is obviously preferred for a
proper bug report.

It also has an instance of RFC 9462 applied, which is "discovery
of designated resolvers", pointing clients to the DoT and DoH
endpoints this instance serves by publishing _dns.resolver.arpa
SVCB records in the DNS view for the clients.  As a consequence,
a fair number of queries (20%? 30%?) arrive over those
transports.

For now we have downgraded BIND to 9.18.34 on the two nodes where
similar trouble has been reported, and we will in all probability
do the same for the remaining two nodes in the cluster.  ...which
is a shame, really, but having to deal with this sort of issue
popping up at unpredictable times, exposing our users to it is
... not exactly ideal.

So...  What I guess I'm doing with this message is ask if anyone
else have been experiencing anything resembling this problem, or
if anyone have any more clues to share to guide further debugging
of this problem?

FWIW, we're running BIND on NetBSD/amd64 10.0 on these nodes.

Best regards,

- Håvard
-- 
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list

ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Reply via email to