> bind crashes with assertion, maybe due to many ephemeral network devices?
Looking at the symptoms and your description, I actually think this is a problem of interfaces appearing during the network interface scan and then disappearing before named can process them. I would suggest to disable the automatic-interface-scan and setup named to listen of fixed addresses so it doesn't have to deal with the mayhem the docker is creating. I've unblocked and "trusted" your account, so it should not get blocked again. If you setup 2fa on the account it also acts as a permanent marked this not a spam account. Feel free to fill the issue, but I can't promise this will be looked at quite soon as this is in the "doctor it hurts when I do this" territory. Ondrej -- Ondřej Surý (He/Him) ond...@isc.org My working hours and your working hours may be different. Please do not feel obligated to reply outside your normal working hours. > On 10. 3. 2025, at 21:19, Erich Eckner <b...@eckner.net> wrote: > > Hi, > > I'm running bind version 9.20.6 on artix linux (an arch linux derivate > without systemd) with a pretty standard config: > > # named -V > BIND 9.20.6 (Stable Release) <id:72cbad0> > running on Linux x86_64 6.13.5-artix1-1 #1 SMP PREEMPT_DYNAMIC Fri, 28 Feb > 2025 10:18:15 +0000 > built by make with '--prefix=/usr' '--sysconfdir=/etc' '--sbindir=/usr/bin' > '--localstatedir=/var' '--disable-static' '--enable-fixed-rrset' > '--enable-full-report' '--with-maxminddb' '--with-openssl' '--with-libidn2' > '--with-json-c' '--with-libxml2' '--with-lmdb' 'CFLAGS=-march=x86-64 > -mtune=generic -O2 -pipe -fno-plt -fexceptions > -Wp,-D_FORTIFY_SOURCE=3 -Wformat -Werror=format-security > -fstack-clash-protection -fcf-protection -flto=auto -DDIG_SIGCHASE' > 'LDFLAGS=-Wl,-O1 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now > -Wl,-z,pack-relative-relocs -flto=auto' > compiled by GCC 14.2.1 20250207 > compiled with OpenSSL version: OpenSSL 3.4.1 11 Feb 2025 > linked to OpenSSL version: OpenSSL 3.4.1 11 Feb 2025 > compiled with libuv version: 1.50.0 > linked to libuv version: 1.50.0 > compiled with liburcu version: 0.15.0 > compiled with jemalloc version: 5.3.0 > compiled with libnghttp2 version: 1.64.0 > linked to libnghttp2 version: 1.65.0 > compiled with libxml2 version: 2.13.5 > linked to libxml2 version: 21306-GITv2.13.6 > compiled with json-c version: 0.18 > linked to json-c version: 0.18 > compiled with zlib version: 1.3.1 > linked to zlib version: 1.3.1 > linked to maxminddb version: 1.12.2 > threads support is enabled > DNSSEC algorithms: RSASHA1 NSEC3RSASHA1 RSASHA256 RSASHA512 ECDSAP256SHA256 > ECDSAP384SHA384 ED25519 ED448 > DS algorithms: SHA-1 SHA-256 SHA-384 > HMAC algorithms: HMAC-MD5 HMAC-SHA1 HMAC-SHA224 HMAC-SHA256 HMAC-SHA384 > HMAC-SHA512 > TKEY mode 2 support (Diffie-Hellman): no > TKEY mode 3 support (GSS-API): yes > > default paths: > named configuration: /etc/named.conf > rndc configuration: /etc/rndc.conf > nsupdate session key: /var/run/named/session.key > named PID file: /var/run/named/named.pid > geoip-directory: /usr/share/GeoIP > > > # grep '^\s*[^[:space:]#/]' /etc/named.conf > options { > directory "/var/named"; > pid-file "/run/named/named.pid"; > allow-recursion { 127.0.0.1; 192.168.188.0/24; }; > allow-transfer { none; }; > allow-update { none; }; > version none; > hostname none; > server-id none; > }; > zone "localhost" IN { > type master; > file "localhost.zone"; > }; > zone "0.0.127.in-addr.arpa" IN { > type master; > file "127.0.0.zone"; > }; > zone > "1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa" { > type master; > file "localhost.ip6.zone"; > }; > > # pgrep -af named > 22958 /usr/bin/named -u named -L /var/log/named.log > > Since a few days (or weeks?) now, it started to act up. Every few ten > minutes, it crashes with: > > 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): > unexpected error: > 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code > in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device > 10-Mar-2025 20:33:36.996 network: error: creating IPv6 interface veth731351f > failed; interface ignored > 10-Mar-2025 20:33:36.996 network: info: listening on IPv6 interface > vetha808625, fe80::d0cf:5fff:fe3a:1e50%954915#53 > 10-Mar-2025 20:33:36.998 network: info: listening on IPv6 interface > veth92035bc, fe80::58f0:c5ff:fecf:4a8d%954971#53 > 10-Mar-2025 20:33:37.000 network: info: listening on IPv6 interface > vethb1ef26b, fe80::58e2:d2ff:fe3f:c77f%955141#53 > 10-Mar-2025 20:33:37.003 network: info: listening on IPv6 interface > veth0ee3ea4, fe80::44be:c7ff:fefd:83fb%955153#53 > 10-Mar-2025 20:33:37.005 network: info: listening on IPv6 interface > veth39e879e, fe80::34fb:98ff:fe9e:d49f%955162#53 > 10-Mar-2025 20:33:37.007 network: info: listening on IPv6 interface > veth2f2d6df, fe80::2c2b:e8ff:fe8e:2339%955167#53 > 10-Mar-2025 20:33:37.010 network: info: listening on IPv6 interface > vetha0e2b2b, fe80::84fd:7aff:fe72:9c82%955207#53 > 10-Mar-2025 20:33:37.012 network: info: listening on IPv6 interface > vethb633142, fe80::58a5:32ff:feaf:bdb2%955208#53 > 10-Mar-2025 20:33:37.014 network: info: listening on IPv6 interface > veth232d291, fe80::f442:a2ff:fe0d:18f8%955383#53 > 10-Mar-2025 20:33:37.017 network: info: listening on IPv6 interface > vetha87c0e9, fe80::2431:26ff:fe1e:adac%955384#53 > 10-Mar-2025 20:33:37.021 network: info: listening on IPv6 interface > vethadab24f, fe80::7d:44ff:fe11:7284%955606#53 > 10-Mar-2025 20:33:37.024 network: info: listening on IPv6 interface > vethe9c8381, fe80::1847:42ff:fe98:cd5c%955655#53 > 10-Mar-2025 20:33:37.026 network: info: listening on IPv6 interface > veth5f5869a, fe80::ec06:66ff:fe5d:ef74%955668#53 > 10-Mar-2025 20:33:37.029 network: info: listening on IPv6 interface > vethe46d2e1, fe80::f48e:14ff:fe94:2efd%955683#53 > 10-Mar-2025 20:33:37.032 network: info: listening on IPv6 interface > vethf87bbe4, fe80::6c0b:47ff:fed2:404d%955686#53 > 10-Mar-2025 20:33:37.035 network: info: listening on IPv6 interface > veth207c7ca, fe80::f019:b8ff:feda:517d%955692#53 > 10-Mar-2025 20:33:37.038 network: info: listening on IPv6 interface > veth1654fa8, fe80::fc83:fcff:fe79:8f01%955718#53 > 10-Mar-2025 20:33:37.041 network: info: listening on IPv6 interface > vethe4e528f, fe80::901d:7fff:fe58:ed2%955719#53 > 10-Mar-2025 20:33:37.041 general: critical: > netmgr/udp.c:77:isc__nm_udp_lb_socket(): fatal error: > 10-Mar-2025 20:33:37.041 general: critical: RUNTIME_CHECK(result == > ISC_R_SUCCESS) failed > 10-Mar-2025 20:33:37.041 general: critical: exiting (due to fatal error in > library) > > As a first-aid, I added a script to simply restart the nameserver, if it > crashes. This showed me two things: > > 1. If the server crashed, a restart will fail for the next one or two > minutes, too. > > 2. The crashes seem to correlate with the other main load, that I have on > this machine: A couple hundred docker containers (each of which apparently > setting up a network device on the host system), that are started every ten > minutes and run for a few minutes (in rare cases longer). Looking at the > minutes of the assertion-logs, there is a clear emphasis on minutes when many > containers start(?)/run/stop: > > $ grep -F 'RUNTIME_CHECK(result == ISC_R_SUCCESS)' /var/log/named.log | cut > -d' ' -f2 | cut -d: -f2 | cut -c2 | sort | uniq -c > 5976 0 > 14767 1 > 42850 2 > 31292 3 > 693 4 > 204 5 > 199 6 > 211 7 > 226 8 > 198 9 > > The containers are started via a cronjob: > */10 * * * * /home/erich/git/archlinuxewe/build-all-with-docker > > In between the crashes, the nameserver seems to run as-expected. Also, the > docker containers (which require working name resolution on the host system) > do not always fail, so at least sometime / somewhen, named seems to > successfully process the requests of the containers. > > I hope, someone has an idea, where I should look at. It feels strange, that > such a "reference" product as bind should be crashable simply by having a big > number of fluctuating network devices. > > Some side notes, maybe less related to the issue at hand, but I still want to > write them here for the case, that they are relevant: > > The system seems to be somewhat under load during the run of the containers, > but I would be astonished, if this would cause bind to crash: RAM usage goes > up to 16GB of 128GB possible, CPU goes up to 100%, though. > > I have a second, similar machine (same distribution, similar setup regarding > bind), but without the "pulsed" load of docker containers, where named is > running since *looks*up*the*numbers* more than 8 days without crashes (which > matches the uptime of that machine). > > I wanted to open a bug at gitlab.isc.org, but my account ("deep42thought" > under which I reported something a few years ago) got blocked after getting > reactivated again, because I did not notice the big warning on the login page > stating exactly this behaviour and took >1 day to gather the information for > the bug. :-( Maybe someone can unblock me, then I could add 2FA to persist > the account? > > Some time ago I tried to get the stats channel working through > > options { > zone-statistics full; > } > statistics-channels { > inet 127.0.0.1 port 8053; > }; > > but this seemed to crash the server back then. And since it was just a toy > project, I didn't pursue it any further and have removed it from the config > since quite some time. > > regards, > Erich > -- > Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from > this list > > ISC funds the development of this software with paid support subscriptions. > Contact us at https://www.isc.org/contact/ for more information. > > > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users
signature.asc
Description: Message signed with OpenPGP
-- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users