** Description changed: + [Impact] + + Nova suffers from a race condition when it does live migrations of vms + with SRIOV ports whereby a pre-check of available ports and their + capabilities can error if one or more ports becomes unavailable during + the check. The fix backported here simply ignores libvirt errors when + checking device capabilities resulting in those that throw an error + being ignored. + + [Test Plan] + + Since the bug is a race condition it can be hard to reproduce but a + succession of live migrations between SRIOV capable nodes with a + reasonably large quantity of VFs should be a reasonable test. + + * deploy OpenStack Yoga with SRIOV capable hardward + * create 10 vms with e.g. 5 sriov ports + * live migrate the vms between the hosts and check for the Traceback in /var/log/nova/nova-compute.log + + [Regression Potential] + This patch is not anticipated to introduce any regressions. + ------------------------------------------------- + At the moment, the `_get_pci_passthrough_devices` function is prone to race conditions. This specific code here calls `listCaps()`, however, it is possible that the device has disappeared by the time on method has been called: https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L7949-L7959 Which would result in the following traceback: 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager [req-51b7c1c4-2b4a-46cc-9baa-8bf61801c48d - - - - -] Error updating resources for node <snip>.: libvirt.libvirtError: Node device not found: no node device with matching name 'net_tap8b08ec90_e5_fe_16_3e_0f_0a_d4' 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager Traceback (most recent call last): 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/compute/manager.py", line 9946, in _update_available_resource_for_node 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager self.rt.update_available_resource(context, nodename, 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/compute/resource_tracker.py", line 879, in update_available_resource 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager resources = self.driver.get_available_resource(nodename) 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 8937, in get_available_resource 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager data['pci_passthrough_devices'] = self._get_pci_passthrough_devices() 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 7663, in _get_pci_passthrough_devices 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager vdpa_devs = [ 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 7664, in <listcomp> 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager dev for dev in devices.values() if "vdpa" in dev.listCaps() 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/libvirt.py", line 6276, in listCaps 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager raise libvirtError('virNodeDeviceListCaps() failed') 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager libvirt.libvirtError: Node device not found: no node device with matching name 'net_tap8b08ec90_e5_fe_16_3e_0f_0a_d4' - 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager + 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager I think the cleaner way is to loop over all the items and skip a device if it raises an error that the device is not found.
** Summary changed: - _get_pci_passthrough_devices prone to race condition + [SRU] _get_pci_passthrough_devices prone to race condition -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1972028 Title: [SRU] _get_pci_passthrough_devices prone to race condition To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1972028/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs