Indeed, and unfortunately it seems like I've been able to identify the
root cause which is still present in 2.90. What changed, we're no longer
seeing a segfault which was the tell-tale sign we were looking for, but
instead we just see it quietly exit with no trace. This led me to
getting a setup where I could reproduce the issue, which we've seen
trigger on Ironic's "standalone" jobs as they exercise a number of
different scenarios involving port/address updates quite a bit, and I
just sat with strace attached to the dnsmasq process.
alarm(77377) = 77377
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10,
events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13,
events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16,
events=POLLIN}, {fd=17, events=POLLIN}], 14, -1) = ? ERESTART_RESTARTBLOCK
(Interrupted by signal)
--- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=236696, si_uid=0} ---
getpid() = 235993
writev(18, [{iov_base="\1\0\0\0\0\0\0\0\0\0\0\0", iov_len=12}], 1) = 12
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10,
events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13,
events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16,
events=POLLIN}, {fd=17, events=POLLIN}], 14, -1) = 1 ([{fd=17, revents=POLLIN}])
read(17, "\1\0\0\0\0\0\0\0\0\0\0\0", 12) = 12
newfstatat(AT_FDCWD,
"/opt/stack/data/neutron/dhcp/77cabae0-26bf-4374-997a-781947f2e5b2/addn_hosts",
{st_mode=S_IFREG|0644, st_size=1268, ...}, 0) = 0
openat(AT_FDCWD,
"/opt/stack/data/neutron/dhcp/77cabae0-26bf-4374-997a-781947f2e5b2/addn_hosts",
O_RDONLY) = 20
newfstatat(20, "", {st_mode=S_IFREG|0644, st_size=1268, ...}, AT_EMPTY_PATH) = 0
read(20, "10.1.0.1\thost-10-1-0-1.openstack"..., 4096) = 1268
read(20, "", 4096) = 0
close(20) = 0
getpid() = 235993
newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0644, st_size=114,
...}, 0) = 0
getpid() = 235993
write(19, "<30>Feb 29 22:35:30 dnsmasq[2359"..., 133) = 133
writev(2, [{iov_base="free(): invalid pointer", iov_len=23}, {iov_base="\n",
iov_len=1}], 2) = 24
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7fec31710000
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
gettid() = 235993
getpid() = 235993
tgkill(235993, 235993, SIGABRT) = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=235993, si_uid=65534}
---
+++ killed by SIGABRT +++
The above was captured from a dnsmasq install with
https://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=10d8b5f001a34ff46b3a72575f3af64b065f8637
where as running the commit before it, I don't crash out dnsmasq. The result
with 2.90 is basically identical (https://paste.openstack.org/show/823338/).
In terms of the actual logging which results, all we get is along the
lines of https://paste.opendev.org/show/823337/, at least with the
install on jammy, where neutron is directly launching the dnsmasq
processes. Looking at the contents of the commit where this starts
(https://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=blobdiff;f=src/option.c;h=23cf058e8ad9fdcffe621ce0b92d626578d31f82;hp=e8926a49f46050eea9cba7cc90e12d67821f3a2f;hb=10d8b5f001a34ff46b3a72575f3af64b065f8637;hpb=ffd3ceb856eb3e1ff923cae95a2ddb889beab7c1),
it is clear that the logic has been changed into a different which
explains the error we are able to capture with stracce.
I think ironic is going to have to pin to dnsmasq 2.85 for the time
being, until we're able to get this fixed and fully understood in
upstream dnsmasq since this is impacting numerous CI branches and test
jobs.
--
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to dnsmasq in Ubuntu.
https://bugs.launchpad.net/bugs/2026757
Title:
dnsmasq on Ubuntu Jammy crashes on neutron-dhcp-agent updates
Status in Ironic:
Triaged
Status in neutron:
New
Status in dnsmasq package in Ubuntu:
Invalid
Status in dnsmasq source package in Jammy:
Incomplete
Status in dnsmasq source package in Kinetic:
Won't Fix
Status in dnsmasq source package in Lunar:
Invalid
Status in dnsmasq source package in Mantic:
Invalid
Bug description:
The Ironic project's CI has been having major blocking issues moving
to utilizing Ubuntu Jammy and with some investigation we were able to
isolate the issues down to the dhcp updates causing dnsmasq to crash
on Ubuntu Jammy, which ships with dnsmasq 2.86. This issue sounds
similar to an issue known about to the dnsmasq maintainers, where
dnsmasq would crash with updates occurring due to configuration
refresh[0].
This resulted in us upgrading dnsmasq to the version which ships with
Ubuntu Lunar.
Which was no better. Dnsmasq still crashed upon record updates for
addresses and ports getting configuration added/changed/removed.
We later downgraded to the version of dnsmasq shipped in Ubuntu Focal,
and dnsmasq stopped crashing and appeared stable enough to utilize for
CI purposes.
** Kernel log from Ubuntu Jammy Package **
[229798.876726] dnsmasq[81586]: segfault at 7c28 ip 00007f6e8313147e sp
00007fffb3d6f830 error 4 in libc.so.6[7f6e830b4000+195000]
[229798.876745] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85
ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b
47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[229805.444912] dnsmasq[401428]: segfault at dce8 ip 00007fe63bf6a47e sp
00007ffdb105b440 error 4 in libc.so.6[7fe63beed000+195000]
[229805.444933] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85
ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b
47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[230414.213448] dnsmasq[401538]: segfault at 78b8 ip 00007f12160e447e sp
00007ffed6ef2190 error 4 in libc.so.6[7f1216067000+195000]
[230414.213467] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85
ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b
47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[230465.098989] dnsmasq[402665]: segfault at c378 ip 00007f81458f047e sp
00007fff0db334a0 error 4 in libc.so.6[7f8145873000+195000]
[230465.099005] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85
ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b
47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[231787.247374] dnsmasq[402863]: segfault at 7318 ip 00007f3940b9147e sp
00007ffc8df4f010 error 4 in libc.so.6[7f3940b14000+195000]
[231787.247392] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85
ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b
47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[231844.886399] dnsmasq[405182]: segfault at dc58 ip 00007f32a29e147e sp
00007ffddedd7480 error 4 in libc.so.6[7f32a2964000+195000]
[231844.886420] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85
ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b
47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[234692.482154] dnsmasq[405289]: segfault at 67d8 ip 00007fab0c5c447e sp
00007fffd6fd8fa0 error 4 in libc.so.6[7fab0c547000+195000]
[234692.482173] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85
ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b
47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
** Kernel log entries from Ubuntu Lunar package **
[234724.842339] dnsmasq[409843]: segfault at fffffffffffffffd ip
00007f35a147647e sp 00007ffd536038c0 error 5 in libc.so.6[7f35a13f9000+195000]
[234724.842368] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85
ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b
47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[234784.918116] dnsmasq[410019]: segfault at fffffffffffffffd ip
00007f634233947e sp 00007fff33877f20 error 5 in libc.so.6[7f63422bc000+195000]
[234784.918133] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85
ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b
47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[235022.163339] dnsmasq[410151]: segfault at fffffffffffffffd ip
00007f21dd37f47e sp 00007fff9bf416d0 error 5 in libc.so.6[7f21dd302000+195000]
[235022.163362] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85
ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b
47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[235024.831325] dnsmasq[410445]: segfault at fffffffffffffffd ip
00007f7edf02147e sp 00007ffc4fb19cd0 error 5 in libc.so.6[7f7edefa4000+195000]
[235024.831354] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85
ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b
47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[236052.793683] dnsmasq[410630]: segfault at fffffffffffffffd ip
00007f3046ca147e sp 00007ffe5583df50 error 5 in libc.so.6[7f3046c24000+195000]
[236052.793704] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85
ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b
47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[236105.451351] dnsmasq[412107]: segfault at fffffffffffffffd ip
00007f4425bcd47e sp 00007fffd5337560 error 5 in libc.so.6[7f4425b50000+195000]
[236105.451368] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85
ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b
47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
** The command line the process is launched with **
dnsmasq --no-hosts --pid-
file=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/pid
--dhcp-
hostsfile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/host
--addn-
hosts=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/addn_hosts
--dhcp-
optsfile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/opts
--dhcp-
leasefile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/leases
--dhcp-match=set:ipxe,175 --dhcp-userclass=set:ipxe6,iPXE --local-
service--bind-dynamic --dhcp-
range=set:subnet-3c1445e7-6f7d-4e62-997f-627bc53da72c,10.1.0.0,static,255.255.255.192,86400s
--dhcp-option-force=option:mtu,1380 --dhcp-lease-max=64 --conf-
file=/dev/null --domain=openstacklocal
** Neutron Logging **
Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG
neutron.agent.dhcp.agent [-] neutron.agent.dhcp.agent.DhcpAgentWithStateReport
method _port_delete called with arguments ({'port_id':
'bdeaa43c-687c-4e60-a24e-3725d6353828', 'network_id':
'c1ca059e-350d-4d78-9330-600f7315c380', 'fixed_ips': [{'subnet_id':
'3c1445e7-6f7d-4e62-997f-627bc53da72c', 'ip_address': '10.1.0.14'},
{'subnet_id': '54bc71f6-bff5-417d-9e4b-1f5f58ed6318', 'ip_address':
'fdd9:92b1:9e2c:0:5054:ff:fe44:5c9f'}], 'priority': 6},) {} {{(pid=60941)
wrapper /usr/local/lib/python3.10/dist-packages/oslo_log/helpers.py:65}}
Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG
neutron.agent.dhcp.agent [-] Calling driver for network:
c1ca059e-350d-4d78-9330-600f7315c380/seg=None action: reload_allocations
{{(pid=60941) _call_driver /opt/stack/neutron/neutron/agent/dhcp/agent.py:246}}
Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG
oslo_concurrency.processutils [-] Running cmd (subprocess): ip netns exec
qdhcp-c1ca059e-350d-4d78-9330-600f7315c380 dhcp_release tapbb6348d9-39
10.1.0.14 52:54:00:44:5c:9f {{(pid=78114) execute
/usr/local/lib/python3.10/dist-packages/oslo_concurrency/processutils.py:384}}
Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG
oslo_concurrency.processutils [-] CMD "ip netns exec
qdhcp-c1ca059e-350d-4d78-9330-600f7315c380 dhcp_release tapbb6348d9-39
10.1.0.14 52:54:00:44:5c:9f" returned: 0 in 0.011s {{(pid=78114) execute
/usr/local/lib/python3.10/dist-packages/oslo_concurrency/processutils.py:422}}
Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG
oslo.privsep.daemon [-] privsep: reply[8a4f2794-3b63-4f8d-9604-53dd6a4a868c]:
(4, ('', '')) {{(pid=78114) _call_back
/usr/local/lib/python3.10/dist-packages/oslo_privsep/daemon.py:501}}
Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG
oslo_concurrency.processutils [-] Running cmd (subprocess): ip netns exec
qdhcp-c1ca059e-350d-4d78-9330-600f7315c380 dhcp_release tapbb6348d9-39
10.1.0.14 52:54:00:44:5c:9f 01:52:54:00:44:5c:9f {{(pid=78114) execute
/usr/local/lib/python3.10/dist-packages/oslo_concurrency/processutils.py:384}}
Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG
oslo_concurrency.processutils [-] CMD "ip netns exec
qdhcp-c1ca059e-350d-4d78-9330-600f7315c380 dhcp_release tapbb6348d9-39
10.1.0.14 52:54:00:44:5c:9f 01:52:54:00:44:5c:9f" returned: 0 in 0.011s
{{(pid=78114) execute
/usr/local/lib/python3.10/dist-packages/oslo_concurrency/processutils.py:422}}
Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG
oslo.privsep.daemon [-] privsep: reply[33a91aed-bc58-48dd-b673-d4a4d5da54f6]:
(4, ('', '')) {{(pid=78114) _call_back
/usr/local/lib/python3.10/dist-packages/oslo_privsep/daemon.py:501}}
Jul 10 15:26:02 np0034614991 neutron-dhcp-agent[60941]: DEBUG
neutron.agent.linux.dhcp [-] Building host file:
/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/host
{{(pid=60941) _output_hosts_file
/opt/stack/neutron/neutron/agent/linux/dhcp.py:956}}
Jul 10 15:26:02 np0034614991 neutron-dhcp-agent[60941]: DEBUG
neutron.agent.linux.utils [-] Running command: ['env', 'LC_ALL=C',
'PATH=/sbin:/usr/sbin', 'dnsmasq', '--test', '--dhcp-host=tag:foo']
{{(pid=60941) create_process
/opt/stack/neutron/neutron/agent/linux/utils.py:84}}
Jul 10 15:26:02 np0034614991 neutron-dhcp-agent[60941]: DEBUG
neutron.agent.linux.dhcp [-] Done building host file
/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/host
{{(pid=60941) _output_hosts_file
/opt/stack/neutron/neutron/agent/linux/dhcp.py:997}}
Jul 10 15:26:02 np0034614991 neutron-dhcp-agent[60941]: DEBUG
oslo.privsep.daemon [-] privsep: reply[f3dd1224-fe8c-4fb0-8113-699e779df64e]:
(4, ('', '', 0)) {{(pid=62248) _call_back
/usr/local/lib/python3.10/dist-packages/oslo_privsep/daemon.py:501}}
Jul 10 15:27:00 np0034614991 neutron-dhcp-agent[60941]: DEBUG
oslo_concurrency.lockutils [-] Acquiring lock "_check_child_processes" by
"neutron.agent.linux.external_process.ProcessMonitor._check_child_processes"
{{(pid=60941) inner
/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py:404}}
Jul 10 15:27:00 np0034614991 neutron-dhcp-agent[60941]: DEBUG
oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by
"neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" ::
waited 0.001s {{(pid=60941) inner
/usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py:409}}
Jul 10 15:27:00 np0034614991 neutron-dhcp-agent[60941]: ERROR
neutron.agent.linux.external_process [-] dnsmasq for dhcp with uuid
c1ca059e-350d-4d78-9330-600f7315c380 not found. The process should not have died
Jul 10 15:27:00 np0034614991 neutron-dhcp-agent[60941]: WARNING
neutron.agent.linux.external_process [-] Respawning dnsmasq for uuid
c1ca059e-350d-4d78-9330-600f7315c380
Jul 10 15:27:00 np0034614991 neutron-dhcp-agent[60941]: DEBUG
neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip',
'netns', 'exec', 'qdhcp-c1ca059e-350d-4d78-9330-600f7315c380', 'env',
'PROCESS_TAG=dnsmasq-c1ca059e-350d-4d78-9330-600f7315c380', 'dnsmasq',
'--no-hosts', '',
'--pid-file=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/pid',
'--dhcp-hostsfile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/host',
'--addn-hosts=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/addn_hosts',
'--dhcp-optsfile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/opts',
'--dhcp-leasefile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/leases',
'--dhcp-match=set:ipxe,175', '--dhcp-userclass=set:ipxe6,iPXE',
'--local-service', '--bind-dynamic',
'--dhcp-range=set:subnet-3c1445e7-6f7d-4e62-997f-627bc53da72c,10.1.0.0,static,255.255.255.192,86400s',
'--dhcp-option-force=option:mtu,1380', '--dhcp-l
ease-max=64', '--conf-file=/dev/null', '--domain=openstacklocal']
{{(pid=60941) execute_rootwrap_daemon
/opt/stack/neutron/neutron/agent/linux/utils.py:108}}
We don't believe this is a neutron bug, at least outright, but suspect
neutron is also likely encountering this issue as well, at least with any sort
of exhaustive test jobs. Most of Ironic's one job tests would pass with this
dnsmasq, it was only where we continually ran new test scenarios that we would
see this issue crop up and cause failures.
In the mean time, the ironic project will likely downgrade dnsmasq to
unblock it's CI.
[0]: https://lists.thekelleys.org.uk/pipermail/dnsmasq-
discuss/2022q3/016562.html
To manage notifications about this bug go to:
https://bugs.launchpad.net/ironic/+bug/2026757/+subscriptions
--
Mailing list: https://launchpad.net/~touch-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~touch-packages
More help : https://help.launchpad.net/ListHelp