Public bug reported: * Version: master branch head as of 11 July 2018 https://github.com/openstack/neutron/commit/5db397958160ea3dc952794fb7f6ec68a2da2055
* Summary: When the operation that make tap interface disappeared/appeared in short interval executed like rebuilding VM, linuxbridge-agent can miss updated device events depending on when tap device disappeared. this cause eventually following "VirtualInterfaceCreateException" error in nova-compute because neutron-server didn't send vif_plugged event to Nova. --- File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4946, in _create_domain_and_network 2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] raise exception.VirtualInterfaceCreateException() 2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] VirtualInterfaceCreateException: Virtual Interface creation failed 2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] --- * Reproducing: Actually this is very difficult to reproduce because the pre-condition to make it reproduce strongly depending on the running state in linuxbridge-agent, so I'm gonna explain the state transition for the logic of detection to updated device step by step --- let's say Hypervisor have 1 tap device for 1 VM which is "tapA" and tapA's interface index is 1 and User just requested rebuilding this VM. 0. Previous device info is like following {'added': set(), 'current': set("tapA"), 'updated': set(), 'removed': set(), 'timestamps': {"tapA": 1}} 1. Get current_devices https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/agent/_common_agent.py#L377 -> current_devices is "tapA" __ disappeared tapA due to rebuilding VM __ 2. Get timestamp(interface index in the case of linuxbridge-agent) https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L395 -> current timestamp is {"tapA": None}. this is because we failed to get interface information here 3. Check device locally changed or not https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L397 -> locally "tapA" is detected as locally changed device. because timestamp information is change from before (1 != None) 4. Generate device_info like following {'added': set(), 'current': set("tapA"), 'updated': set("tapA"), 'removed': set(), 'timestamps': {"tapA": None}} 5. Process linuxbridge-agent interface plugging logic for tapA, but checking device existence failed because there is no such a device. here note even if check for device existence failed, this function won't raise exception and re-sync won't happen https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py#L521-L531 -- appeared tapA again due to rebuilding VM -- -- next scan_device iteration start -- 6. Get current_devices -> current_devices is "tapA" 7. Get timestamp -> current timestamp is {"tapA":2}. 8. Check device locally changed or not -> no locally device is detected because of this line https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L369 9. Generate device_info like following {'added': set(), 'current': set("tapA"), 'updated': set(), 'removed': set(), 'timestamps': {"tapA": 2}} next iteration is expected to detect device updated but didn't in this case. So we have to improve this locally changed device detection logic. otherwise rebooting/rebuilding operation would fail sometimes(it's really rare case though) ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1781129 Title: linuxbridge-agent missed updated device sometimes Status in neutron: New Bug description: * Version: master branch head as of 11 July 2018 https://github.com/openstack/neutron/commit/5db397958160ea3dc952794fb7f6ec68a2da2055 * Summary: When the operation that make tap interface disappeared/appeared in short interval executed like rebuilding VM, linuxbridge-agent can miss updated device events depending on when tap device disappeared. this cause eventually following "VirtualInterfaceCreateException" error in nova-compute because neutron-server didn't send vif_plugged event to Nova. --- File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4946, in _create_domain_and_network 2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] raise exception.VirtualInterfaceCreateException() 2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] VirtualInterfaceCreateException: Virtual Interface creation failed 2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] --- * Reproducing: Actually this is very difficult to reproduce because the pre-condition to make it reproduce strongly depending on the running state in linuxbridge-agent, so I'm gonna explain the state transition for the logic of detection to updated device step by step --- let's say Hypervisor have 1 tap device for 1 VM which is "tapA" and tapA's interface index is 1 and User just requested rebuilding this VM. 0. Previous device info is like following {'added': set(), 'current': set("tapA"), 'updated': set(), 'removed': set(), 'timestamps': {"tapA": 1}} 1. Get current_devices https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/agent/_common_agent.py#L377 -> current_devices is "tapA" __ disappeared tapA due to rebuilding VM __ 2. Get timestamp(interface index in the case of linuxbridge-agent) https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L395 -> current timestamp is {"tapA": None}. this is because we failed to get interface information here 3. Check device locally changed or not https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L397 -> locally "tapA" is detected as locally changed device. because timestamp information is change from before (1 != None) 4. Generate device_info like following {'added': set(), 'current': set("tapA"), 'updated': set("tapA"), 'removed': set(), 'timestamps': {"tapA": None}} 5. Process linuxbridge-agent interface plugging logic for tapA, but checking device existence failed because there is no such a device. here note even if check for device existence failed, this function won't raise exception and re-sync won't happen https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py#L521-L531 -- appeared tapA again due to rebuilding VM -- -- next scan_device iteration start -- 6. Get current_devices -> current_devices is "tapA" 7. Get timestamp -> current timestamp is {"tapA":2}. 8. Check device locally changed or not -> no locally device is detected because of this line https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L369 9. Generate device_info like following {'added': set(), 'current': set("tapA"), 'updated': set(), 'removed': set(), 'timestamps': {"tapA": 2}} next iteration is expected to detect device updated but didn't in this case. So we have to improve this locally changed device detection logic. otherwise rebooting/rebuilding operation would fail sometimes(it's really rare case though) To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1781129/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp