Hello everyone, I have found an issue in dnsmasq v2.90 that is causing problems in our Openstack environments. When our Neutron agents rewrite the configs and send a SIGHUP to trigger a reload, dnsmasq will (usually) crash with a SIGABRT signal. This only seems to happen in our busiest Openstack regions where VMs are coming and going constantly, causing dnsmasq to reload many times per minute. In other regions where there are no new VMs being created, the reloads work fine with no crashes.
I investigated in a very busy region where I see dozens of crashes per minute. It is only using dnsmasq for DHCP, it is not receiving DNS queries. This is a production environment, but I rebuilt dnsmasq with debug symbols and managed to capture this with gdb when it crashes. I tried it a few times and the crash always has the same stack trace. ################################################################################ Reading symbols from /usr/lib/debug/usr/sbin/dnsmasq-2.90-1.el9.x86_64.debug... Attaching to program: /usr/lib/debug/usr/sbin/dnsmasq-2.90-1.el9.x86_64.debug, process 3075598 <snip loading messages> [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00007ff1c8c62ac7 in poll () from target:/lib64/libc.so.6 (gdb) c Continuing. Program received signal SIGHUP, Hangup. 0x00007ff1c8c62ac7 in poll () from target:/lib64/libc.so.6 (gdb) c Continuing. Program received signal SIGABRT, Aborted. 0x00007ff1c8beca6c in __pthread_kill_implementation () from target:/lib64/libc.so.6 (gdb) where #0 0x00007ff1c8beca6c in __pthread_kill_implementation () from target:/lib64/libc.so.6 #1 0x00007ff1c8b9f686 in raise () from target:/lib64/libc.so.6 #2 0x00007ff1c8b89833 in abort () from target:/lib64/libc.so.6 #3 0x00007ff1c8b8a170 in __libc_message.cold () from target:/lib64/libc.so.6 #4 0x00007ff1c8bf6b17 in malloc_printerr () from target:/lib64/libc.so.6 #5 0x00007ff1c8bf8800 in _int_free () from target:/lib64/libc.so.6 #6 0x00007ff1c8bfae55 in free () from target:/lib64/libc.so.6 #7 0x000055f6521e0c18 in dhcp_netid_free (nid=0x7ff1c8bfae55 <free+85>) at /usr/src/debug/dnsmasq-2.90-1.el9.x86_64/src/option.c:1333 #8 dhcp_netid_list_free (netid=0x0) at /usr/src/debug/dnsmasq-2.90-1.el9.x86_64/src/option.c:1363 #9 dhcp_config_free (config=0x55f652b51a60) at /usr/src/debug/dnsmasq-2.90-1.el9.x86_64/src/option.c:1381 #10 0x000055f652b51930 in ?? () #11 0x000055f6529eb1f8 in ?? () #12 0x0000000000000fa4 in ?? () #13 0x000055f6529eaf60 in ?? () #14 0x000055f6529eaf60 in ?? () #15 0x000055f6521f5259 in clear_dynamic_conf () at /usr/src/debug/dnsmasq-2.90-1.el9.x86_64/src/option.c:5777 #16 reread_dhcp () at /usr/src/debug/dnsmasq-2.90-1.el9.x86_64/src/option.c:5818 #17 clear_cache_and_reload (now=94516438056960) at /usr/src/debug/dnsmasq-2.90-1.el9.x86_64/src/dnsmasq.c:1742 #18 0x4141414141414141 in ?? () #19 0x0000000067ae1dbd in ?? () #20 0x0000000000000000 in ?? () (gdb) ################################################################################ The dnsmasq command line looks like this (lightly redacted): dnsmasq --no-hosts --no-resolv \ --pid-file=/var/lib/neutron/dhcp/xxx/pid \ --dhcp-hostsfile=/var/lib/neutron/dhcp/xxx/host \ --addn-hosts=/var/lib/neutron/dhcp/xxx/addn_hosts \ --dhcp-optsfile=/var/lib/neutron/dhcp/xxx/opts \ --dhcp-leasefile=/var/lib/neutron/dhcp/xxx/leases \ --dhcp-match=set:ipxe,175 \ --dhcp-userclass=set:ipxe6,iPXE \ --local-service \ --bind-dynamic \ --dhcp-range=set:subnet-yyy,10.1.1.0,static,255.255.248.0,86400s \ --dhcp-range=set:subnet-zzz,10.2.1.0,static,255.255.252.0,86400s \ --dhcp-option-force=option:mtu,1500 \ --dhcp-lease-max=3072 \ --conf-file=/etc/neutron/dnsmasq-neutron.conf The /etc/neutron/dnsmasq-neutron.conf file only sets these options (lightly redacted): dhcp-boot=smsboot\pxelinux.com,boothost,10.0.1.2 dhcp-option=option:ntp-server,10.0.0.1,10.0.1.1,10.0.2.1 The /var/lib/neutron/dhcp/xxx/host file contains between 800-3000 entries, depending on the time of day. They each look something like this (lightly redacted): fa:16:3e:3b:ad:b9,set:16a8f84b90f640f7a2c9a133d844985e,host-10-1-2-3,10.1.2.3 The /var/lib/neutron/dhcp/xxx/addn_hosts file contains between 800-3000 entries, depending on the time of day. They each look something like this (lightly redacted): 10.1.9.9 np0006812233.subdomain.subdomain.mycorp.com. np0006812233 The /var/lib/neutron/dhcp/xxx/opts file contains about 190 entries. The top of the file looks like this, the rest of the entries are just like the last two lines, defining more domain-name and domain-search values for additional subdomains (lightly redacted): tag:subnet-xxx,option:dns-server,10.0.0.10,10.0.0.11 tag:subnet-xxx,option:classless-static-route,10.1.1.0/22,0.0.0.0,169.254.169.254/32,10.2.1.30,0.0.0.0/0,10.2.1.1 tag:subnet-xxx,249,10.1.1.0/22,0.0.0.0,169.254.169.254/32,10.2.1.30,0.0.0.0/0,10.2.1.1 tag:subnet-xxx,option:router,10.2.1.1 tag:subnet-yyy,option:dns-server,0.0.10,10.0.0.11 tag:subnet-yyy,option:classless-static-route,10.2.1.0/21,0.0.0.0,169.254.169.254/32,10.1.1.30,0.0.0.0/0,10.1.1.1 tag:subnet-yyy,249,10.2.1.0/21,0.0.0.0,169.254.169.254/32,10.1.1.30,0.0.0.0/0,10.1.1.1 tag:subnet-yyy,option:router,10.1.1.1 tag:16a8f84b90f640f7a2c9a133d844985e,option:domain-name,subdomain.subdomain.mycorp.com tag:16a8f84b90f640f7a2c9a133d844985e,option:domain-search,subdomain.subdomain. mycorp.com,subdomain. mycorp.com, mycorp.com The /var/lib/neutron/xxx/leases file contains between 800-3000 entries, depending on the time of day. They each look something like this (lightly redacted): 1739552375 fa:16:3e:3b:ad:b9 10.1.2.3 np0006812233 * What can I do to help troubleshoot this? I know C but I’m not familiar with the dnsmasq code. Thanks in advance! -- Sam Clippinger
_______________________________________________ Dnsmasq-discuss mailing list Dnsmasq-discuss@lists.thekelleys.org.uk https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss