** Changed in: nova (Ubuntu) Status: New => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1972028
Title: [SRU] _get_pci_passthrough_devices prone to race condition Status in Ubuntu Cloud Archive: New Status in Ubuntu Cloud Archive yoga series: New Status in Ubuntu Cloud Archive zed series: Fix Released Status in OpenStack Compute (nova): Fix Released Status in nova package in Ubuntu: Fix Released Status in nova source package in Jammy: Fix Committed Status in nova source package in Noble: Fix Released Bug description: [Impact] Nova suffers from a race condition when it does live migrations of vms with SRIOV ports whereby a pre-check of available ports and their capabilities can error if one or more ports becomes unavailable during the check. The fix backported here simply ignores libvirt errors when checking device capabilities resulting in those that throw an error being ignored. [Test Plan] Since the bug is a race condition it can be hard to reproduce but a succession of live migrations between SRIOV capable nodes with a reasonably large quantity of VFs should be a reasonable test. * deploy OpenStack Yoga with SRIOV capable hardward * create 10 vms with e.g. 5 sriov ports * live migrate the vms between the hosts and check for the Traceback in /var/log/nova/nova-compute.log [Regression Potential] This patch is not anticipated to introduce any regressions. ------------------------------------------------- At the moment, the `_get_pci_passthrough_devices` function is prone to race conditions. This specific code here calls `listCaps()`, however, it is possible that the device has disappeared by the time on method has been called: https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L7949-L7959 Which would result in the following traceback: 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager [req-51b7c1c4-2b4a-46cc-9baa-8bf61801c48d - - - - -] Error updating resources for node <snip>.: libvirt.libvirtError: Node device not found: no node device with matching name 'net_tap8b08ec90_e5_fe_16_3e_0f_0a_d4' 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager Traceback (most recent call last): 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/compute/manager.py", line 9946, in _update_available_resource_for_node 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager self.rt.update_available_resource(context, nodename, 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/compute/resource_tracker.py", line 879, in update_available_resource 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager resources = self.driver.get_available_resource(nodename) 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 8937, in get_available_resource 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager data['pci_passthrough_devices'] = self._get_pci_passthrough_devices() 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 7663, in _get_pci_passthrough_devices 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager vdpa_devs = [ 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 7664, in <listcomp> 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager dev for dev in devices.values() if "vdpa" in dev.listCaps() 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/libvirt.py", line 6276, in listCaps 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager raise libvirtError('virNodeDeviceListCaps() failed') 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager libvirt.libvirtError: Node device not found: no node device with matching name 'net_tap8b08ec90_e5_fe_16_3e_0f_0a_d4' 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager I think the cleaner way is to loop over all the items and skip a device if it raises an error that the device is not found. To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1972028/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp