Hello everyone,

I have found an issue in dnsmasq v2.90 that is causing problems in our 
Openstack environments.  When our Neutron agents rewrite the configs and send a 
SIGHUP to trigger a reload, dnsmasq will (usually) crash with a SIGABRT signal. 
 This only seems to happen in our busiest Openstack regions where VMs are 
coming and going constantly, causing dnsmasq to reload many times per minute.  
In other regions where there are no new VMs being created, the reloads work 
fine with no crashes.

I investigated in a very busy region where I see dozens of crashes per minute.  
It is only using dnsmasq for DHCP, it is not receiving DNS queries.  This is a 
production environment, but I rebuilt dnsmasq with debug symbols and managed to 
capture this with gdb when it crashes.  I tried it a few times and the crash 
always has the same stack trace.
################################################################################
Reading symbols from /usr/lib/debug/usr/sbin/dnsmasq-2.90-1.el9.x86_64.debug...
Attaching to program: /usr/lib/debug/usr/sbin/dnsmasq-2.90-1.el9.x86_64.debug, 
process 3075598
<snip loading messages>
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007ff1c8c62ac7 in poll () from target:/lib64/libc.so.6
(gdb) c
Continuing.

Program received signal SIGHUP, Hangup.
0x00007ff1c8c62ac7 in poll () from target:/lib64/libc.so.6
(gdb) c
Continuing.

Program received signal SIGABRT, Aborted.
0x00007ff1c8beca6c in __pthread_kill_implementation () from 
target:/lib64/libc.so.6
(gdb) where
#0  0x00007ff1c8beca6c in __pthread_kill_implementation () from 
target:/lib64/libc.so.6
#1  0x00007ff1c8b9f686 in raise () from target:/lib64/libc.so.6
#2  0x00007ff1c8b89833 in abort () from target:/lib64/libc.so.6
#3  0x00007ff1c8b8a170 in __libc_message.cold () from target:/lib64/libc.so.6
#4  0x00007ff1c8bf6b17 in malloc_printerr () from target:/lib64/libc.so.6
#5  0x00007ff1c8bf8800 in _int_free () from target:/lib64/libc.so.6
#6  0x00007ff1c8bfae55 in free () from target:/lib64/libc.so.6
#7  0x000055f6521e0c18 in dhcp_netid_free (nid=0x7ff1c8bfae55 <free+85>) at 
/usr/src/debug/dnsmasq-2.90-1.el9.x86_64/src/option.c:1333
#8  dhcp_netid_list_free (netid=0x0) at 
/usr/src/debug/dnsmasq-2.90-1.el9.x86_64/src/option.c:1363
#9  dhcp_config_free (config=0x55f652b51a60) at 
/usr/src/debug/dnsmasq-2.90-1.el9.x86_64/src/option.c:1381
#10 0x000055f652b51930 in ?? ()
#11 0x000055f6529eb1f8 in ?? ()
#12 0x0000000000000fa4 in ?? ()
#13 0x000055f6529eaf60 in ?? ()
#14 0x000055f6529eaf60 in ?? ()
#15 0x000055f6521f5259 in clear_dynamic_conf () at 
/usr/src/debug/dnsmasq-2.90-1.el9.x86_64/src/option.c:5777
#16 reread_dhcp () at /usr/src/debug/dnsmasq-2.90-1.el9.x86_64/src/option.c:5818
#17 clear_cache_and_reload (now=94516438056960) at 
/usr/src/debug/dnsmasq-2.90-1.el9.x86_64/src/dnsmasq.c:1742
#18 0x4141414141414141 in ?? ()
#19 0x0000000067ae1dbd in ?? ()
#20 0x0000000000000000 in ?? ()
(gdb)
################################################################################

The dnsmasq command line looks like this (lightly redacted):
dnsmasq --no-hosts --no-resolv \
    --pid-file=/var/lib/neutron/dhcp/xxx/pid \
    --dhcp-hostsfile=/var/lib/neutron/dhcp/xxx/host \
    --addn-hosts=/var/lib/neutron/dhcp/xxx/addn_hosts \
    --dhcp-optsfile=/var/lib/neutron/dhcp/xxx/opts \
    --dhcp-leasefile=/var/lib/neutron/dhcp/xxx/leases \
    --dhcp-match=set:ipxe,175 \
    --dhcp-userclass=set:ipxe6,iPXE \
    --local-service \
    --bind-dynamic \
    --dhcp-range=set:subnet-yyy,10.1.1.0,static,255.255.248.0,86400s \
    --dhcp-range=set:subnet-zzz,10.2.1.0,static,255.255.252.0,86400s \
    --dhcp-option-force=option:mtu,1500 \
    --dhcp-lease-max=3072 \
    --conf-file=/etc/neutron/dnsmasq-neutron.conf

The /etc/neutron/dnsmasq-neutron.conf file only sets these options (lightly 
redacted):
dhcp-boot=smsboot\pxelinux.com,boothost,10.0.1.2
dhcp-option=option:ntp-server,10.0.0.1,10.0.1.1,10.0.2.1

The /var/lib/neutron/dhcp/xxx/host file contains between 800-3000 entries, 
depending on the time of day.  They each look something like this (lightly 
redacted):
fa:16:3e:3b:ad:b9,set:16a8f84b90f640f7a2c9a133d844985e,host-10-1-2-3,10.1.2.3

The /var/lib/neutron/dhcp/xxx/addn_hosts file contains between 800-3000 
entries, depending on the time of day.  They each look something like this 
(lightly redacted):
10.1.9.9     np0006812233.subdomain.subdomain.mycorp.com. np0006812233

The /var/lib/neutron/dhcp/xxx/opts file contains about 190 entries.  The top of 
the file looks like this, the rest of the entries are just like the last two 
lines, defining more domain-name and domain-search values for additional 
subdomains (lightly redacted):
tag:subnet-xxx,option:dns-server,10.0.0.10,10.0.0.11
tag:subnet-xxx,option:classless-static-route,10.1.1.0/22,0.0.0.0,169.254.169.254/32,10.2.1.30,0.0.0.0/0,10.2.1.1
tag:subnet-xxx,249,10.1.1.0/22,0.0.0.0,169.254.169.254/32,10.2.1.30,0.0.0.0/0,10.2.1.1
tag:subnet-xxx,option:router,10.2.1.1
tag:subnet-yyy,option:dns-server,0.0.10,10.0.0.11
tag:subnet-yyy,option:classless-static-route,10.2.1.0/21,0.0.0.0,169.254.169.254/32,10.1.1.30,0.0.0.0/0,10.1.1.1
tag:subnet-yyy,249,10.2.1.0/21,0.0.0.0,169.254.169.254/32,10.1.1.30,0.0.0.0/0,10.1.1.1
tag:subnet-yyy,option:router,10.1.1.1
tag:16a8f84b90f640f7a2c9a133d844985e,option:domain-name,subdomain.subdomain.mycorp.com
tag:16a8f84b90f640f7a2c9a133d844985e,option:domain-search,subdomain.subdomain. 
mycorp.com,subdomain. mycorp.com, mycorp.com

The /var/lib/neutron/xxx/leases file contains between 800-3000 entries, 
depending on the time of day.  They each look something like this (lightly 
redacted):
1739552375 fa:16:3e:3b:ad:b9 10.1.2.3 np0006812233 *

What can I do to help troubleshoot this?  I know C but I’m not familiar with 
the dnsmasq code.  Thanks in advance!

-- Sam Clippinger

_______________________________________________
Dnsmasq-discuss mailing list
Dnsmasq-discuss@lists.thekelleys.org.uk
https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss

Reply via email to