Hi all,
so we started getting ‘Address already in use’ when trying to start dnsmasq
after the previous instance of the process is killed with kill -9. Armando
spotted it today in logs for: https://review.openstack.org/#/c/377626/ but
as per logstash it seems like an error we saw before (the earliest I see is
9/20), f.e.:
http://logs.openstack.org/26/377626/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz
Assuming I understand the flow of the failure, it runs as follows:
- sync_state starts dnsmasq per network;
- after agent lock is freed, some other notification event
(port_update/subnet_update/...) triggers restart for one of the processes;
- the restart is done not via reload_allocations (-SIGHUP) but thru
restart/disable (kill -9);
- once the old dnsmasq is killed with -9, we attempt to start a new process
with new config files generated and fail with: “dnsmasq: failed to create
listening socket for 10.1.15.242: Address already in use”
- surprisingly, after several failing attempts to start the process, it
succeeds to start it after a bunch of seconds and runs fine.
It looks like once we kill the process with -9, it may hold for the socket
resource for some time and may clash with the new process we try to spawn.
It’s a bit weird because dnsmasq should have set REUSEADDR for the socket,
so a new process should have started just fine.
Lately, we landed several patches that touched reload logic for DHCP agent
on notifications. Among those suspicious in the context are:
- https://review.openstack.org/#/c/372595/ - note it requests ‘disable’
(-9) where it was using ‘reload_allocations’ (-SIGHUP) before, and it also
does not unplug the port on lease release (maybe after we rip of the
device, the address clash with the old dnsmasq state is gone even though
the ’new’ port will use the same address?).
- https://review.openstack.org/#/c/372236/6 - we were requesting
reload_allocations in some cases before, and now we put the network into
resync queue
There were other related changes lately, you can check history of Kevin’s
changes for the branch, it should capture most of them.
I wonder whether we hit some long standing restart issue with dnsmasq here
that was just never triggered before because we were not calling kill -9 so
eagerly as we do now.
Note: Jakub Libosvar validated that 'kill -9 && dnsmasq’ in loop does NOT
result in the failure we see in gate logs.
We need to understand what’s going with the failure, and come up with some
plan for Newton. We either revert suspected patches as I believe Armando
proposed before, but then it’s not clear until which point to do it; or we
come up with some smart fix for that, that I don’t immediately grasp.
I will be on vacation tomorrow, though I will check the email thread to see
if we have a plan to act on. I really hope folks give the issue a priority
since it seems like we buried ourselves under a pile of interleaved patches
and now we don’t have a clear view of how to get out of the pile.
Cheers,
Ihar
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev