[openstack-dev] dhcp 'Address already in use' errors when trying to start a dnsmasq

Ihar Hrachyshka Tue, 27 Sep 2016 11:25:29 -0700

Hi all,

so we started getting ‘Address already in use’ when trying to start dnsmasqafter the previous instance of the process is killed with kill -9. Armandospotted it today in logs for: https://review.openstack.org/#/c/377626/ butas per logstash it seems like an error we saw before (the earliest I see is9/20), f.e.:


http://logs.openstack.org/26/377626/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz

Assuming I understand the flow of the failure, it runs as follows:

- sync_state starts dnsmasq per network;

- after agent lock is freed, some other notification event(port_update/subnet_update/...) triggers restart for one of the processes;- the restart is done not via reload_allocations (-SIGHUP) but thrurestart/disable (kill -9);- once the old dnsmasq is killed with -9, we attempt to start a new processwith new config files generated and fail with: “dnsmasq: failed to createlistening socket for 10.1.15.242: Address already in use”- surprisingly, after several failing attempts to start the process, itsucceeds to start it after a bunch of seconds and runs fine.

It looks like once we kill the process with -9, it may hold for the socketresource for some time and may clash with the new process we try to spawn.It’s a bit weird because dnsmasq should have set REUSEADDR for the socket,so a new process should have started just fine.

Lately, we landed several patches that touched reload logic for DHCP agenton notifications. Among those suspicious in the context are:

- https://review.openstack.org/#/c/372595/ - note it requests ‘disable’(-9) where it was using ‘reload_allocations’ (-SIGHUP) before, and it alsodoes not unplug the port on lease release (maybe after we rip of thedevice, the address clash with the old dnsmasq state is gone even thoughthe ’new’ port will use the same address?).- https://review.openstack.org/#/c/372236/6 - we were requestingreload_allocations in some cases before, and now we put the network intoresync queue

There were other related changes lately, you can check history of Kevin’schanges for the branch, it should capture most of them.

I wonder whether we hit some long standing restart issue with dnsmasq herethat was just never triggered before because we were not calling kill -9 soeagerly as we do now.

Note: Jakub Libosvar validated that 'kill -9 && dnsmasq’ in loop does NOTresult in the failure we see in gate logs.

We need to understand what’s going with the failure, and come up with someplan for Newton. We either revert suspected patches as I believe Armandoproposed before, but then it’s not clear until which point to do it; or wecome up with some smart fix for that, that I don’t immediately grasp.

I will be on vacation tomorrow, though I will check the email thread to seeif we have a plan to act on. I really hope folks give the issue a prioritysince it seems like we buried ourselves under a pile of interleaved patchesand now we don’t have a clear view of how to get out of the pile.


Cheers,
Ihar

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] dhcp 'Address already in use' errors when trying to start a dnsmasq

Reply via email to