Reviewed: https://review.openstack.org/581648 Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b803195a9979f7b3b0fd9cea41699b33fc8cf2bb Submitter: Zuul Branch: master
commit b803195a9979f7b3b0fd9cea41699b33fc8cf2bb Author: Yuki Nishiwaki <uckey.1...@gmail.com> Date: Wed Jul 25 16:05:30 2018 +0900 Dont use dict.get() to know certain key is in dict In CommonAgentLoop class, there is logic to detect tap device is changed locally or not by comparing timestamp with previous. Sometimes timestamp value could be None depending on the timing (see bug/1781129) But current _get_devices_locally_modified logic can not detect local change from None to something because _get_devices_locally_modified function don't always compare if previous timestamp value was None. In order not to miss updated device always, better not to use dict.get() to know previous iteration have timestamp or not. Change-Id: Ib0361ad5c281f88558e8e048cfec588b9f9b1de4 Closes-Bug: #1781129 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1781129 Title: linuxbridge-agent missed updated device sometimes Status in neutron: Fix Released Bug description: * Version: master branch head as of 11 July 2018 https://github.com/openstack/neutron/commit/5db397958160ea3dc952794fb7f6ec68a2da2055 * Summary: When the operation that make tap interface disappeared/appeared in short interval executed like rebuilding VM, linuxbridge-agent can miss updated device events depending on when tap device disappeared. this cause eventually following "VirtualInterfaceCreateException" error in nova-compute because neutron-server didn't send vif_plugged event to Nova. --- File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4946, in _create_domain_and_network 2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] raise exception.VirtualInterfaceCreateException() 2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] VirtualInterfaceCreateException: Virtual Interface creation failed 2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] --- * Reproducing: Actually this is very difficult to reproduce because the pre-condition to make it reproduce strongly depending on the running state in linuxbridge-agent, so I'm gonna explain the state transition for the logic of detection to updated device step by step --- let's say Hypervisor have 1 tap device for 1 VM which is "tapA" and tapA's interface index is 1 and User just requested rebuilding this VM. 0. Previous device info is like following {'added': set(), 'current': set("tapA"), 'updated': set(), 'removed': set(), 'timestamps': {"tapA": 1}} 1. Get current_devices https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/agent/_common_agent.py#L377 -> current_devices is "tapA" __ disappeared tapA due to rebuilding VM __ 2. Get timestamp(interface index in the case of linuxbridge-agent) https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L395 -> current timestamp is {"tapA": None}. this is because we failed to get interface information here 3. Check device locally changed or not https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L397 -> locally "tapA" is detected as locally changed device. because timestamp information is change from before (1 != None) 4. Generate device_info like following {'added': set(), 'current': set("tapA"), 'updated': set("tapA"), 'removed': set(), 'timestamps': {"tapA": None}} 5. Process linuxbridge-agent interface plugging logic for tapA, but checking device existence failed because there is no such a device. here note even if check for device existence failed, this function won't raise exception and re-sync won't happen https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py#L521-L531 -- appeared tapA again due to rebuilding VM -- -- next scan_device iteration start -- 6. Get current_devices -> current_devices is "tapA" 7. Get timestamp -> current timestamp is {"tapA":2}. 8. Check device locally changed or not -> no locally device is detected because of this line https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L369 9. Generate device_info like following {'added': set(), 'current': set("tapA"), 'updated': set(), 'removed': set(), 'timestamps': {"tapA": 2}} next iteration is expected to detect device updated but didn't in this case. So we have to improve this locally changed device detection logic. otherwise rebooting/rebuilding operation would fail sometimes(it's really rare case though) To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1781129/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp