** Description changed:

+ [Impact]
+ 
+ Nova suffers from a race condition when it does live migrations of vms
+ with SRIOV ports whereby a pre-check of available ports and their
+ capabilities can error if one or more ports becomes unavailable during
+ the check. The fix backported here simply ignores libvirt errors when
+ checking device capabilities resulting in those that throw an error
+ being ignored.
+ 
+ [Test Plan]
+ 
+ Since the bug is a race condition it can be hard to reproduce but a
+ succession of live migrations between SRIOV capable nodes with a
+ reasonably large quantity of VFs should be a reasonable test.
+ 
+ * deploy OpenStack Yoga with SRIOV capable hardward
+ * create 10 vms with e.g. 5 sriov ports
+ * live migrate the vms between the hosts and check for the Traceback in 
/var/log/nova/nova-compute.log
+ 
+ [Regression Potential]
+ This patch is not anticipated to introduce any regressions.
+ -------------------------------------------------
+ 
  At the moment, the `_get_pci_passthrough_devices` function is prone to
  race conditions.
  
  This specific code here calls `listCaps()`, however, it is possible that
  the device has disappeared by the time on method has been called:
  
  
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L7949-L7959
  
  Which would result in the following traceback:
  
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager 
[req-51b7c1c4-2b4a-46cc-9baa-8bf61801c48d - - - - -] Error updating resources 
for node <snip>.: libvirt.libvirtError: Node device not found: no node device 
with matching name 'net_tap8b08ec90_e5_fe_16_3e_0f_0a_d4'
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager Traceback (most 
recent call last):
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/compute/manager.py", line 
9946, in _update_available_resource_for_node
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     
self.rt.update_available_resource(context, nodename,
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/compute/resource_tracker.py",
 line 879, in update_available_resource
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     resources = 
self.driver.get_available_resource(nodename)
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", 
line 8937, in get_available_resource
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     
data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", 
line 7663, in _get_pci_passthrough_devices
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     vdpa_devs = [
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", 
line 7664, in <listcomp>
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     dev for dev in 
devices.values() if "vdpa" in dev.listCaps()
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/libvirt.py", line 6276, in 
listCaps
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     raise 
libvirtError('virNodeDeviceListCaps() failed')
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager 
libvirt.libvirtError: Node device not found: no node device with matching name 
'net_tap8b08ec90_e5_fe_16_3e_0f_0a_d4'
- 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager 
+ 2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager
  
  I think the cleaner way is to loop over all the items and skip a device
  if it raises an error that the device is not found.

** Summary changed:

- _get_pci_passthrough_devices prone to race condition
+ [SRU] _get_pci_passthrough_devices prone to race condition

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1972028

Title:
  [SRU] _get_pci_passthrough_devices prone to race condition

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1972028/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to