There is no side effect other than log noise and a delayed reload? I don't see why a revert would be appropriate.
I looked at the logs and the issue seems to be that the process isn't tracked correctly the first time it starts. grep for the following: ea141299-ce07-4ff7-9a03-7a1b7a75a371', 'dnsmasq' in http://logs.openstack.org/26/377626/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz The first time dnsmasq is called it gives a 0 return code but the agent doesn't seem to get a pid for it. So the next time it is called it conflicts with the running proc. On Sep 27, 2016 11:22, "Ihar Hrachyshka" <ihrac...@redhat.com> wrote: > Hi all, > > so we started getting ‘Address already in use’ when trying to start > dnsmasq after the previous instance of the process is killed with kill -9. > Armando spotted it today in logs for: https://review.openstack.org/# > /c/377626/ but as per logstash it seems like an error we saw before (the > earliest I see is 9/20), f.e.: > > http://logs.openstack.org/26/377626/1/check/gate-tempest-dsv > m-neutron-full-ubuntu-xenial/b6953d4/logs/screen-q-dhcp.txt.gz > > Assuming I understand the flow of the failure, it runs as follows: > > - sync_state starts dnsmasq per network; > - after agent lock is freed, some other notification event > (port_update/subnet_update/...) triggers restart for one of the processes; > - the restart is done not via reload_allocations (-SIGHUP) but thru > restart/disable (kill -9); > - once the old dnsmasq is killed with -9, we attempt to start a new > process with new config files generated and fail with: “dnsmasq: failed to > create listening socket for 10.1.15.242: Address already in use” > - surprisingly, after several failing attempts to start the process, it > succeeds to start it after a bunch of seconds and runs fine. > > It looks like once we kill the process with -9, it may hold for the socket > resource for some time and may clash with the new process we try to spawn. > It’s a bit weird because dnsmasq should have set REUSEADDR for the socket, > so a new process should have started just fine. > > Lately, we landed several patches that touched reload logic for DHCP agent > on notifications. Among those suspicious in the context are: > > - https://review.openstack.org/#/c/372595/ - note it requests ‘disable’ > (-9) where it was using ‘reload_allocations’ (-SIGHUP) before, and it also > does not unplug the port on lease release (maybe after we rip of the > device, the address clash with the old dnsmasq state is gone even though > the ’new’ port will use the same address?). > - https://review.openstack.org/#/c/372236/6 - we were requesting > reload_allocations in some cases before, and now we put the network into > resync queue > > There were other related changes lately, you can check history of Kevin’s > changes for the branch, it should capture most of them. > > I wonder whether we hit some long standing restart issue with dnsmasq here > that was just never triggered before because we were not calling kill -9 so > eagerly as we do now. > > Note: Jakub Libosvar validated that 'kill -9 && dnsmasq’ in loop does NOT > result in the failure we see in gate logs. > > We need to understand what’s going with the failure, and come up with some > plan for Newton. We either revert suspected patches as I believe Armando > proposed before, but then it’s not clear until which point to do it; or we > come up with some smart fix for that, that I don’t immediately grasp. > > I will be on vacation tomorrow, though I will check the email thread to > see if we have a plan to act on. I really hope folks give the issue a > priority since it seems like we buried ourselves under a pile of > interleaved patches and now we don’t have a clear view of how to get out of > the pile. > > Cheers, > Ihar >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev