Indeed, and unfortunately it seems like I've been able to identify the root cause which is still present in 2.90. What changed, we're no longer seeing a segfault which was the tell-tale sign we were looking for, but instead we just see it quietly exit with no trace. This led me to getting a setup where I could reproduce the issue, which we've seen trigger on Ironic's "standalone" jobs as they exercise a number of different scenarios involving port/address updates quite a bit, and I just sat with strace attached to the dnsmasq process.
alarm(77377) = 77377 poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}], 14, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=236696, si_uid=0} --- getpid() = 235993 writev(18, [{iov_base="\1\0\0\0\0\0\0\0\0\0\0\0", iov_len=12}], 1) = 12 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}], 14, -1) = 1 ([{fd=17, revents=POLLIN}]) read(17, "\1\0\0\0\0\0\0\0\0\0\0\0", 12) = 12 newfstatat(AT_FDCWD, "/opt/stack/data/neutron/dhcp/77cabae0-26bf-4374-997a-781947f2e5b2/addn_hosts", {st_mode=S_IFREG|0644, st_size=1268, ...}, 0) = 0 openat(AT_FDCWD, "/opt/stack/data/neutron/dhcp/77cabae0-26bf-4374-997a-781947f2e5b2/addn_hosts", O_RDONLY) = 20 newfstatat(20, "", {st_mode=S_IFREG|0644, st_size=1268, ...}, AT_EMPTY_PATH) = 0 read(20, "10.1.0.1\thost-10-1-0-1.openstack"..., 4096) = 1268 read(20, "", 4096) = 0 close(20) = 0 getpid() = 235993 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0644, st_size=114, ...}, 0) = 0 getpid() = 235993 write(19, "<30>Feb 29 22:35:30 dnsmasq[2359"..., 133) = 133 writev(2, [{iov_base="free(): invalid pointer", iov_len=23}, {iov_base="\n", iov_len=1}], 2) = 24 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fec31710000 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 gettid() = 235993 getpid() = 235993 tgkill(235993, 235993, SIGABRT) = 0 --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=235993, si_uid=65534} --- +++ killed by SIGABRT +++ The above was captured from a dnsmasq install with https://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=10d8b5f001a34ff46b3a72575f3af64b065f8637 where as running the commit before it, I don't crash out dnsmasq. The result with 2.90 is basically identical (https://paste.openstack.org/show/823338/). In terms of the actual logging which results, all we get is along the lines of https://paste.opendev.org/show/823337/, at least with the install on jammy, where neutron is directly launching the dnsmasq processes. Looking at the contents of the commit where this starts (https://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=blobdiff;f=src/option.c;h=23cf058e8ad9fdcffe621ce0b92d626578d31f82;hp=e8926a49f46050eea9cba7cc90e12d67821f3a2f;hb=10d8b5f001a34ff46b3a72575f3af64b065f8637;hpb=ffd3ceb856eb3e1ff923cae95a2ddb889beab7c1), it is clear that the logic has been changed into a different which explains the error we are able to capture with stracce. I think ironic is going to have to pin to dnsmasq 2.85 for the time being, until we're able to get this fixed and fully understood in upstream dnsmasq since this is impacting numerous CI branches and test jobs. -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to dnsmasq in Ubuntu. https://bugs.launchpad.net/bugs/2026757 Title: dnsmasq on Ubuntu Jammy crashes on neutron-dhcp-agent updates Status in Ironic: Triaged Status in neutron: New Status in dnsmasq package in Ubuntu: Invalid Status in dnsmasq source package in Jammy: Incomplete Status in dnsmasq source package in Kinetic: Won't Fix Status in dnsmasq source package in Lunar: Invalid Status in dnsmasq source package in Mantic: Invalid Bug description: The Ironic project's CI has been having major blocking issues moving to utilizing Ubuntu Jammy and with some investigation we were able to isolate the issues down to the dhcp updates causing dnsmasq to crash on Ubuntu Jammy, which ships with dnsmasq 2.86. This issue sounds similar to an issue known about to the dnsmasq maintainers, where dnsmasq would crash with updates occurring due to configuration refresh[0]. This resulted in us upgrading dnsmasq to the version which ships with Ubuntu Lunar. Which was no better. Dnsmasq still crashed upon record updates for addresses and ports getting configuration added/changed/removed. We later downgraded to the version of dnsmasq shipped in Ubuntu Focal, and dnsmasq stopped crashing and appeared stable enough to utilize for CI purposes. ** Kernel log from Ubuntu Jammy Package ** [229798.876726] dnsmasq[81586]: segfault at 7c28 ip 00007f6e8313147e sp 00007fffb3d6f830 error 4 in libc.so.6[7f6e830b4000+195000] [229798.876745] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a [229805.444912] dnsmasq[401428]: segfault at dce8 ip 00007fe63bf6a47e sp 00007ffdb105b440 error 4 in libc.so.6[7fe63beed000+195000] [229805.444933] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a [230414.213448] dnsmasq[401538]: segfault at 78b8 ip 00007f12160e447e sp 00007ffed6ef2190 error 4 in libc.so.6[7f1216067000+195000] [230414.213467] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a [230465.098989] dnsmasq[402665]: segfault at c378 ip 00007f81458f047e sp 00007fff0db334a0 error 4 in libc.so.6[7f8145873000+195000] [230465.099005] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a [231787.247374] dnsmasq[402863]: segfault at 7318 ip 00007f3940b9147e sp 00007ffc8df4f010 error 4 in libc.so.6[7f3940b14000+195000] [231787.247392] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a [231844.886399] dnsmasq[405182]: segfault at dc58 ip 00007f32a29e147e sp 00007ffddedd7480 error 4 in libc.so.6[7f32a2964000+195000] [231844.886420] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a [234692.482154] dnsmasq[405289]: segfault at 67d8 ip 00007fab0c5c447e sp 00007fffd6fd8fa0 error 4 in libc.so.6[7fab0c547000+195000] [234692.482173] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a ** Kernel log entries from Ubuntu Lunar package ** [234724.842339] dnsmasq[409843]: segfault at fffffffffffffffd ip 00007f35a147647e sp 00007ffd536038c0 error 5 in libc.so.6[7f35a13f9000+195000] [234724.842368] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a [234784.918116] dnsmasq[410019]: segfault at fffffffffffffffd ip 00007f634233947e sp 00007fff33877f20 error 5 in libc.so.6[7f63422bc000+195000] [234784.918133] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a [235022.163339] dnsmasq[410151]: segfault at fffffffffffffffd ip 00007f21dd37f47e sp 00007fff9bf416d0 error 5 in libc.so.6[7f21dd302000+195000] [235022.163362] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a [235024.831325] dnsmasq[410445]: segfault at fffffffffffffffd ip 00007f7edf02147e sp 00007ffc4fb19cd0 error 5 in libc.so.6[7f7edefa4000+195000] [235024.831354] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a [236052.793683] dnsmasq[410630]: segfault at fffffffffffffffd ip 00007f3046ca147e sp 00007ffe5583df50 error 5 in libc.so.6[7f3046c24000+195000] [236052.793704] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a [236105.451351] dnsmasq[412107]: segfault at fffffffffffffffd ip 00007f4425bcd47e sp 00007fffd5337560 error 5 in libc.so.6[7f4425b50000+195000] [236105.451368] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a ** The command line the process is launched with ** dnsmasq --no-hosts --pid- file=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/pid --dhcp- hostsfile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/host --addn- hosts=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/addn_hosts --dhcp- optsfile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/opts --dhcp- leasefile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/leases --dhcp-match=set:ipxe,175 --dhcp-userclass=set:ipxe6,iPXE --local- service--bind-dynamic --dhcp- range=set:subnet-3c1445e7-6f7d-4e62-997f-627bc53da72c,10.1.0.0,static,255.255.255.192,86400s --dhcp-option-force=option:mtu,1380 --dhcp-lease-max=64 --conf- file=/dev/null --domain=openstacklocal ** Neutron Logging ** Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG neutron.agent.dhcp.agent [-] neutron.agent.dhcp.agent.DhcpAgentWithStateReport method _port_delete called with arguments ({'port_id': 'bdeaa43c-687c-4e60-a24e-3725d6353828', 'network_id': 'c1ca059e-350d-4d78-9330-600f7315c380', 'fixed_ips': [{'subnet_id': '3c1445e7-6f7d-4e62-997f-627bc53da72c', 'ip_address': '10.1.0.14'}, {'subnet_id': '54bc71f6-bff5-417d-9e4b-1f5f58ed6318', 'ip_address': 'fdd9:92b1:9e2c:0:5054:ff:fe44:5c9f'}], 'priority': 6},) {} {{(pid=60941) wrapper /usr/local/lib/python3.10/dist-packages/oslo_log/helpers.py:65}} Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG neutron.agent.dhcp.agent [-] Calling driver for network: c1ca059e-350d-4d78-9330-600f7315c380/seg=None action: reload_allocations {{(pid=60941) _call_driver /opt/stack/neutron/neutron/agent/dhcp/agent.py:246}} Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): ip netns exec qdhcp-c1ca059e-350d-4d78-9330-600f7315c380 dhcp_release tapbb6348d9-39 10.1.0.14 52:54:00:44:5c:9f {{(pid=78114) execute /usr/local/lib/python3.10/dist-packages/oslo_concurrency/processutils.py:384}} Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo_concurrency.processutils [-] CMD "ip netns exec qdhcp-c1ca059e-350d-4d78-9330-600f7315c380 dhcp_release tapbb6348d9-39 10.1.0.14 52:54:00:44:5c:9f" returned: 0 in 0.011s {{(pid=78114) execute /usr/local/lib/python3.10/dist-packages/oslo_concurrency/processutils.py:422}} Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo.privsep.daemon [-] privsep: reply[8a4f2794-3b63-4f8d-9604-53dd6a4a868c]: (4, ('', '')) {{(pid=78114) _call_back /usr/local/lib/python3.10/dist-packages/oslo_privsep/daemon.py:501}} Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): ip netns exec qdhcp-c1ca059e-350d-4d78-9330-600f7315c380 dhcp_release tapbb6348d9-39 10.1.0.14 52:54:00:44:5c:9f 01:52:54:00:44:5c:9f {{(pid=78114) execute /usr/local/lib/python3.10/dist-packages/oslo_concurrency/processutils.py:384}} Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo_concurrency.processutils [-] CMD "ip netns exec qdhcp-c1ca059e-350d-4d78-9330-600f7315c380 dhcp_release tapbb6348d9-39 10.1.0.14 52:54:00:44:5c:9f 01:52:54:00:44:5c:9f" returned: 0 in 0.011s {{(pid=78114) execute /usr/local/lib/python3.10/dist-packages/oslo_concurrency/processutils.py:422}} Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo.privsep.daemon [-] privsep: reply[33a91aed-bc58-48dd-b673-d4a4d5da54f6]: (4, ('', '')) {{(pid=78114) _call_back /usr/local/lib/python3.10/dist-packages/oslo_privsep/daemon.py:501}} Jul 10 15:26:02 np0034614991 neutron-dhcp-agent[60941]: DEBUG neutron.agent.linux.dhcp [-] Building host file: /opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/host {{(pid=60941) _output_hosts_file /opt/stack/neutron/neutron/agent/linux/dhcp.py:956}} Jul 10 15:26:02 np0034614991 neutron-dhcp-agent[60941]: DEBUG neutron.agent.linux.utils [-] Running command: ['env', 'LC_ALL=C', 'PATH=/sbin:/usr/sbin', 'dnsmasq', '--test', '--dhcp-host=tag:foo'] {{(pid=60941) create_process /opt/stack/neutron/neutron/agent/linux/utils.py:84}} Jul 10 15:26:02 np0034614991 neutron-dhcp-agent[60941]: DEBUG neutron.agent.linux.dhcp [-] Done building host file /opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/host {{(pid=60941) _output_hosts_file /opt/stack/neutron/neutron/agent/linux/dhcp.py:997}} Jul 10 15:26:02 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo.privsep.daemon [-] privsep: reply[f3dd1224-fe8c-4fb0-8113-699e779df64e]: (4, ('', '', 0)) {{(pid=62248) _call_back /usr/local/lib/python3.10/dist-packages/oslo_privsep/daemon.py:501}} Jul 10 15:27:00 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo_concurrency.lockutils [-] Acquiring lock "_check_child_processes" by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" {{(pid=60941) inner /usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py:404}} Jul 10 15:27:00 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: waited 0.001s {{(pid=60941) inner /usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py:409}} Jul 10 15:27:00 np0034614991 neutron-dhcp-agent[60941]: ERROR neutron.agent.linux.external_process [-] dnsmasq for dhcp with uuid c1ca059e-350d-4d78-9330-600f7315c380 not found. The process should not have died Jul 10 15:27:00 np0034614991 neutron-dhcp-agent[60941]: WARNING neutron.agent.linux.external_process [-] Respawning dnsmasq for uuid c1ca059e-350d-4d78-9330-600f7315c380 Jul 10 15:27:00 np0034614991 neutron-dhcp-agent[60941]: DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-c1ca059e-350d-4d78-9330-600f7315c380', 'env', 'PROCESS_TAG=dnsmasq-c1ca059e-350d-4d78-9330-600f7315c380', 'dnsmasq', '--no-hosts', '', '--pid-file=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/pid', '--dhcp-hostsfile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/host', '--addn-hosts=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/addn_hosts', '--dhcp-optsfile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/opts', '--dhcp-leasefile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/leases', '--dhcp-match=set:ipxe,175', '--dhcp-userclass=set:ipxe6,iPXE', '--local-service', '--bind-dynamic', '--dhcp-range=set:subnet-3c1445e7-6f7d-4e62-997f-627bc53da72c,10.1.0.0,static,255.255.255.192,86400s', '--dhcp-option-force=option:mtu,1380', '--dhcp-l ease-max=64', '--conf-file=/dev/null', '--domain=openstacklocal'] {{(pid=60941) execute_rootwrap_daemon /opt/stack/neutron/neutron/agent/linux/utils.py:108}} We don't believe this is a neutron bug, at least outright, but suspect neutron is also likely encountering this issue as well, at least with any sort of exhaustive test jobs. Most of Ironic's one job tests would pass with this dnsmasq, it was only where we continually ran new test scenarios that we would see this issue crop up and cause failures. In the mean time, the ironic project will likely downgrade dnsmasq to unblock it's CI. [0]: https://lists.thekelleys.org.uk/pipermail/dnsmasq- discuss/2022q3/016562.html To manage notifications about this bug go to: https://bugs.launchpad.net/ironic/+bug/2026757/+subscriptions -- Mailing list: https://launchpad.net/~touch-packages Post to : touch-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~touch-packages More help : https://help.launchpad.net/ListHelp