Hi Ondrej,

thanks for the fast answer :)

On Mon, 10 Mar 2025, Ondřej Surý wrote:

bind crashes with assertion, maybe due to many ephemeral network devices?

Looking at the symptoms and your description, I actually think this is a problem
of interfaces appearing during the network interface scan and then disappearing
before named can process them.

I would suggest to disable the automatic-interface-scan and setup named to
listen of fixed addresses so it doesn't have to deal with the mayhem the docker
is creating.

Yes, indeed: That fixes the issue for me! Bind is now running stable for more than 8h.


I've unblocked and "trusted" your account, so it should not get blocked again.
If you setup 2fa on the account it also acts as a permanent marked this not
a spam account.

Thanks a lot, I added 2FA. Though, I think, it will be some time, before I come back and actively participate in the bug tracker (due to bind's stability :D).


Feel free to fill the issue, but I can't promise this will be looked at quite 
soon
as this is in the "doctor it hurts when I do this" territory.

Yeah makes sense: You probably have more important things to do. I'll see, whether the config change has any negative side effects for me, and only open a bug report, if I see any problems with the current solution.


Ondrej

Cheers!
Erich

--
Ondřej Surý (He/Him)
ond...@isc.org

My working hours and your working hours may be different. Please do not feel 
obligated to reply outside your normal working hours.

On 10. 3. 2025, at 21:19, Erich Eckner <b...@eckner.net> wrote:

Hi,

I'm running bind version 9.20.6 on artix linux (an arch linux derivate without 
systemd) with a pretty standard config:

# named -V
BIND 9.20.6 (Stable Release) <id:72cbad0>
running on Linux x86_64 6.13.5-artix1-1 #1 SMP PREEMPT_DYNAMIC Fri, 28 Feb 2025 
10:18:15 +0000
built by make with  '--prefix=/usr' '--sysconfdir=/etc' '--sbindir=/usr/bin' 
'--localstatedir=/var' '--disable-static' '--enable-fixed-rrset' 
'--enable-full-report' '--with-maxminddb' '--with-openssl' '--with-libidn2' 
'--with-json-c' '--with-libxml2' '--with-lmdb' 'CFLAGS=-march=x86-64 
-mtune=generic -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=3 
-Wformat -Werror=format-security         -fstack-clash-protection 
-fcf-protection -flto=auto -DDIG_SIGCHASE' 'LDFLAGS=-Wl,-O1 -Wl,--sort-common 
-Wl,--as-needed -Wl,-z,relro -Wl,-z,now          -Wl,-z,pack-relative-relocs 
-flto=auto'
compiled by GCC 14.2.1 20250207
compiled with OpenSSL version: OpenSSL 3.4.1 11 Feb 2025
linked to OpenSSL version: OpenSSL 3.4.1 11 Feb 2025
compiled with libuv version: 1.50.0
linked to libuv version: 1.50.0
compiled with liburcu version: 0.15.0
compiled with jemalloc version: 5.3.0
compiled with libnghttp2 version: 1.64.0
linked to libnghttp2 version: 1.65.0
compiled with libxml2 version: 2.13.5
linked to libxml2 version: 21306-GITv2.13.6
compiled with json-c version: 0.18
linked to json-c version: 0.18
compiled with zlib version: 1.3.1
linked to zlib version: 1.3.1
linked to maxminddb version: 1.12.2
threads support is enabled
DNSSEC algorithms: RSASHA1 NSEC3RSASHA1 RSASHA256 RSASHA512 ECDSAP256SHA256 
ECDSAP384SHA384 ED25519 ED448
DS algorithms: SHA-1 SHA-256 SHA-384
HMAC algorithms: HMAC-MD5 HMAC-SHA1 HMAC-SHA224 HMAC-SHA256 HMAC-SHA384 
HMAC-SHA512
TKEY mode 2 support (Diffie-Hellman): no
TKEY mode 3 support (GSS-API): yes

default paths:
 named configuration:  /etc/named.conf
 rndc configuration:   /etc/rndc.conf
 nsupdate session key: /var/run/named/session.key
 named PID file:       /var/run/named/named.pid
 geoip-directory:      /usr/share/GeoIP


# grep '^\s*[^[:space:]#/]' /etc/named.conf
options {
   directory "/var/named";
   pid-file "/run/named/named.pid";
   allow-recursion { 127.0.0.1; 192.168.188.0/24; };
   allow-transfer { none; };
   allow-update { none; };
   version none;
   hostname none;
   server-id none;
};
zone "localhost" IN {
   type master;
   file "localhost.zone";
};
zone "0.0.127.in-addr.arpa" IN {
   type master;
   file "127.0.0.zone";
};
zone "1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa" 
{
   type master;
   file "localhost.ip6.zone";
};

# pgrep -af named
22958 /usr/bin/named -u named -L /var/log/named.log

Since a few days (or weeks?) now, it started to act up. Every few ten minutes, 
it crashes with:

10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): 
unexpected error:
10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in 
start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
10-Mar-2025 20:33:36.996 network: error: creating IPv6 interface veth731351f 
failed; interface ignored
10-Mar-2025 20:33:36.996 network: info: listening on IPv6 interface 
vetha808625, fe80::d0cf:5fff:fe3a:1e50%954915#53
10-Mar-2025 20:33:36.998 network: info: listening on IPv6 interface 
veth92035bc, fe80::58f0:c5ff:fecf:4a8d%954971#53
10-Mar-2025 20:33:37.000 network: info: listening on IPv6 interface 
vethb1ef26b, fe80::58e2:d2ff:fe3f:c77f%955141#53
10-Mar-2025 20:33:37.003 network: info: listening on IPv6 interface 
veth0ee3ea4, fe80::44be:c7ff:fefd:83fb%955153#53
10-Mar-2025 20:33:37.005 network: info: listening on IPv6 interface 
veth39e879e, fe80::34fb:98ff:fe9e:d49f%955162#53
10-Mar-2025 20:33:37.007 network: info: listening on IPv6 interface 
veth2f2d6df, fe80::2c2b:e8ff:fe8e:2339%955167#53
10-Mar-2025 20:33:37.010 network: info: listening on IPv6 interface 
vetha0e2b2b, fe80::84fd:7aff:fe72:9c82%955207#53
10-Mar-2025 20:33:37.012 network: info: listening on IPv6 interface 
vethb633142, fe80::58a5:32ff:feaf:bdb2%955208#53
10-Mar-2025 20:33:37.014 network: info: listening on IPv6 interface 
veth232d291, fe80::f442:a2ff:fe0d:18f8%955383#53
10-Mar-2025 20:33:37.017 network: info: listening on IPv6 interface 
vetha87c0e9, fe80::2431:26ff:fe1e:adac%955384#53
10-Mar-2025 20:33:37.021 network: info: listening on IPv6 interface 
vethadab24f, fe80::7d:44ff:fe11:7284%955606#53
10-Mar-2025 20:33:37.024 network: info: listening on IPv6 interface 
vethe9c8381, fe80::1847:42ff:fe98:cd5c%955655#53
10-Mar-2025 20:33:37.026 network: info: listening on IPv6 interface 
veth5f5869a, fe80::ec06:66ff:fe5d:ef74%955668#53
10-Mar-2025 20:33:37.029 network: info: listening on IPv6 interface 
vethe46d2e1, fe80::f48e:14ff:fe94:2efd%955683#53
10-Mar-2025 20:33:37.032 network: info: listening on IPv6 interface 
vethf87bbe4, fe80::6c0b:47ff:fed2:404d%955686#53
10-Mar-2025 20:33:37.035 network: info: listening on IPv6 interface 
veth207c7ca, fe80::f019:b8ff:feda:517d%955692#53
10-Mar-2025 20:33:37.038 network: info: listening on IPv6 interface 
veth1654fa8, fe80::fc83:fcff:fe79:8f01%955718#53
10-Mar-2025 20:33:37.041 network: info: listening on IPv6 interface 
vethe4e528f, fe80::901d:7fff:fe58:ed2%955719#53
10-Mar-2025 20:33:37.041 general: critical: 
netmgr/udp.c:77:isc__nm_udp_lb_socket(): fatal error:
10-Mar-2025 20:33:37.041 general: critical: RUNTIME_CHECK(result == 
ISC_R_SUCCESS) failed
10-Mar-2025 20:33:37.041 general: critical: exiting (due to fatal error in 
library)

As a first-aid, I added a script to simply restart the nameserver, if it 
crashes. This showed me two things:

1. If the server crashed, a restart will fail for the next one or two minutes, 
too.

2. The crashes seem to correlate with the other main load, that I have on this 
machine: A couple hundred docker containers (each of which apparently setting 
up a network device on the host system), that are started every ten minutes and 
run for a few minutes (in rare cases longer). Looking at the minutes of the 
assertion-logs, there is a clear emphasis on minutes when many containers 
start(?)/run/stop:

$ grep -F 'RUNTIME_CHECK(result == ISC_R_SUCCESS)' /var/log/named.log | cut -d' 
' -f2 | cut -d: -f2 | cut -c2 | sort | uniq -c
  5976 0
 14767 1
 42850 2
 31292 3
   693 4
   204 5
   199 6
   211 7
   226 8
   198 9

The containers are started via a cronjob:
*/10 * * * *  /home/erich/git/archlinuxewe/build-all-with-docker

In between the crashes, the nameserver seems to run as-expected. Also, the 
docker containers (which require working name resolution on the host system) do 
not always fail, so at least sometime / somewhen, named seems to successfully 
process the requests of the containers.

I hope, someone has an idea, where I should look at. It feels strange, that such a 
"reference" product as bind should be crashable simply by having a big number 
of fluctuating network devices.

Some side notes, maybe less related to the issue at hand, but I still want to 
write them here for the case, that they are relevant:

The system seems to be somewhat under load during the run of the containers, 
but I would be astonished, if this would cause bind to crash: RAM usage goes up 
to 16GB of 128GB possible, CPU goes up to 100%, though.

I have a second, similar machine (same distribution, similar setup regarding bind), but 
without the "pulsed" load of docker containers, where named is running since 
*looks*up*the*numbers* more than 8 days without crashes (which matches the uptime of that 
machine).

I wanted to open a bug at gitlab.isc.org, but my account ("deep42thought" under 
which I reported something a few years ago) got blocked after getting reactivated again, 
because I did not notice the big warning on the login page stating exactly this behaviour 
and took >1 day to gather the information for the bug. :-( Maybe someone can unblock me, 
then I could add 2FA to persist the account?

Some time ago I tried to get the stats channel working through

options {
   zone-statistics full;
}
statistics-channels {
   inet 127.0.0.1 port 8053;
};

but this seemed to crash the server back then. And since it was just a toy 
project, I didn't pursue it any further and have removed it from the config 
since quite some time.

regards,
Erich
--
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list

ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

-- 
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list

ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Reply via email to